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Abstract 

A distributed XML document is an XML document that spans several machines. 
Wc assume that a distribution design of the document tree is given, consisting of 
an XML kernel- document T[fj f^j where some leaves are "docking points" for 
external resources providing XML subtrees (fi, . . . ,fn, standing, e.g., for Web 
services or peers at remote locations). The top-down design problem consists 
in, given a type (a schema document that may vary from a DTD to a tree au- 
tomaton) for the distributed document, "propagating" locally this type into a 
collection of types, that we call typing^ while preserving desirable properties. 
We also consider the bottom-up design which consists in, given a type for each 
external resource, exhibiting a global type that is enforced by the local types, 
again with natural desirable properties. In the article, we lay out the funda- 
mentals of a theory of distributed XML design, analyze problems concerning 
typing issues in this setting, and study their complexity. 

Keywords: Scmistructured Data, XML Schemas, Distributed Data, Database 
Design, Distributed XML 



1. Introduction 

Context and Motivation. With the Web, information tends to be more and more 
distributed. In particular, the distribution of XML data is essential in many 
areas such as e-commerce (shared product catalog}, collaborating editing (e.g., 
based on WebDAV [IH), or network directories (See also the W3C XML 
Fragment Interchange Working Group [3l.) It becomes often cumbersome to 
verify the validity, e.g., the type, of such a hierarchical structure spanning several 
machines. In this paper, we consider typing issues raised by the distribution of 
XML documents. We introduce "nice" properties that the distribution should 
obey to facilitate type verification based on locality conditions. We propose an 
automata-based study of the problem. Our theoretical investigation provides a 
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starting point for the distributed validation of tree documents (verification) and 
for selecting a distribution for Web data (design). In general, it provides new 
insights in the typing of XML documents. 

A distributed XML document T[ti..t„] is given by an XML kernel- document 
^[fi,...,f„] J ttiat is stored locally at some site, some of which leaves (the docking 
points) refer to external resources, here denoted by fi, . . . , f„, that provide the 
additional XML data to be attached, respectively, to T. For simplicity, 

each node playing the role of docking point is called a function-node and it is 
labeled with the resource that it refers. 

The extension extT{ti..tn) of T is the whole XML document obtainable from 
the distributed document T[tj by replacing the node referring resource ti with 
the forest of XML trees (in left-to-right order) directly connected to the root of 
ti, for each i in 




Figure 1: A distributed XML document for the National Consumer Price Index wliere both 
its kernel and some of its remote XML (sub)documents have been hightailed. 

Figure [1] shows a (drastically simplified) possible distributed XML document 
for the National Consumer Price Lndex ('^CP/jQ maintained by the Eurostat^ 
This example is detailed further in this section. 

Typically, a global designer first chooses a specific language for constraining 
the documents of interest. The focus in this paper is on "structural constraints" . 
Clearly, one could also consider other constraints such as key and referential 
constraints. So, say the designer has to specify documents using DTDs. Then 
he specifies a kernel document Tjf^ f^j together with either: 

bottom-up design: types Ti for each f^; 
top-down design: a global type r. 
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In the bottom-up case, we are interested by the global type that results 
from each local source enforcing its local type. Can such typing be described by 
specific type languages? 

In the top-down case, we would like the extension of T to satisfy r. The issue 
is "/s it possible to enforce it using only local control?" In particular, we would 
like to break down r into local types that could be enforced locally. More 
precisely, wc would like to provide each ti with a typing guaranteeing that (i) 
if each verifies its type, then the global type is verified (soundness), and (ii) 
the typing ri..T„ is not more restrictive than the global type (completeness). 
We call such a typing local typing. We both study (maximal) local typings and 
an even more desirable notion, namely "perfect typings" (to be defined). 

To conclude this introduction, we next detail the Eurosat example. We then 
present a formal overview of the paper (which may be skipped in a first reading.) 
Finally, we survey related works. 

Working Example. Before mentioning some related works and concluding this 
section, wc further illustrate these concepts by detailing our Eurostat example. 

The NCPI is a document containing consumer price data for each EC coun- 
try. Wc assume that the national data are maintained in local XML repositories 
by each country's national statistics bureau (INSEE for France, Statistik for 
Austria, Istat for Italy, UK Statistics Authority, and so on). Each national data 
set is under the strict control of its respective statistics bureau. The kernel doc- 
ument To is maintained by Eurostat in Luxembourg and has a docking point 
for each resource located in a particular country. In addition, Tq contains 
average data for the entire EU zone. Figure [5] shows a possible extension of Tq, 
where the actual data values are omitted. 




Figure 2: The extension of a possible distributed document liaving kernel Tq, and complying 
with the whole structure showed in Figure [l] 

We first assume that Eurostat specifies the global type r for the distributed 
NCPI document, where r is given by the DTD document shown in Figure [31 (In 
the following, we adopt a more succinct notation for types where the content 
model of an element name is either left undefined if it is solely "#pcdata", or 
defined by a rule of the form "index — > value, year", otherwise.) Briefly, DTD 
r requires that each possible extension extTf, (ii-.^n) consists of a subtree con- 
taining average data for Goods (such as food, energy, education, and so on). 
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Each Good item is evaluated in different years by means of an index. Moreover, 
extTo (^i--in) may contain a forest of nationallndex, namely indexes associated 
to goods in precise countries. 



<! ELEMENT eurostat (averages, nationallndex* ) > 
<!ELEMENT averages (Good, index+) + > 

<!ELEMENT nationallndex (country. Good, (index I value, year))> 
<! ELEMENT index (value, year)> 
<! ELEMENT country (#PCDATA)> 
<! ELEMENT Good (#PCDATA)> 
<! ELEMENT value (#PCDATA)> 
<! ELEMENT year (#PCDATA)> 



Figure 3: W3C DTD r 

To comply with different national databases, two different formats are al- 
lowed: (country, Good, index) or (country, Good, value, year). It is easy to see 
that the pair (r, Tq) allows a local typing (see Figure S]) that is even perfect (so, 
can be obtained by the algorithm shown in Section [B]), as we will clarify in the 
next section. 



root; — > nationallndex* 

nationallndex — ^ country. Good, (index I value, year) 
index — > value , year 



Figure 4: Type (1 < i < n) in the perfect typing for the top-down design (t, Tq) 

Suppose now that a designer defined instead the DTD r' shown in Figure [5] 
as global type. The pair (r',To) would be a bad design since r' imposes to all 
countries to adopt the same format for their indexes {natlndA or natlndB). But 
this represents a constraint that cannot be controlled locally. Indeed, this new 
design does not admit any local typing. The nice locality properties of designs 
are obvious in such simplistic examples. However, when dealing with a large 
number of peers with very different desires and complex documents, the problem 
rapidly starts defeating human expertise. Consider, for instance, the type r" 



eurostat — >■ averages, (natlndA* I natlndB*) 

averages — >■ (Good, index+) + 

natlndA — >■ country. Good, index 

natlndB — )■ country. Good, value, year 

index — )■ value , year 



Figure 5: Type r' 

defined in Figure IHl and the kernel Ti = eurostatiji, nationalIndex{f2), fa) 
containing only three function calls. Even if this design is as small as (t, Tq), it 
already starts to become hard to manage with no automatic technique. Here, 
natlndA and natlndB are different specializations of nationallndex elements 
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(note that, as detailed in Section 12. 2[ this feature requires schema languages 
more expressive than DTDs), while all other elements have no specialization. 



eurostat — >■ averages, (natlndA, natlndB) 

averages — >■ (Good, iiidex+) + 

natlndA — > country. Good, index 

natlndB — > country. Good, value, year 

index — > value, year 



Figure 6: Type t" 

In this case, it is not as easy as before to state that the new design has no 
perfect typing and exactly the two maximal local typings shown below (only the 
content models of the roots are specified). This is mainly because the functions 
in Ti have different depth, but also due to specializations. 

rooti —> averages, (natlndA, natlndB)* 
root2 — )■ country, Good, index 

roota — > country, Good, value, year, (natlndA, natlndB)* 

rooti — > averages, (natlndA, natlndB)*, natlndA 
root2 — )■ country. Good, value, year 
roots ~^ (natlndA, natlndB)* 

The techniques developed in this paper are meant to support experts in 
designing such distributed document schemas. 

Overview of Results. We next precise the formal setting of the paper and its 
results. From a formal viewpoint, we use Active XML terminology and notation 
for describing distributed documents Q. 

Not surprisingly, our results depend heavily of the nature of the typing that 
is considered. Fo r types , we consider abstract versions of the conventional typing 
languages d, IM IM IIH , namely 7e-DTDs (for W3C DTDs), 7e-SDTDs (for W3C 
XSD), and 7?.-EDTDs (for regular tree grammars such as Relax-NG) where TZ 
(varying among nFAs. dPAs. nREs. and dREs, namely automata and regular 
expressions both nondeterministic and deterministic) denotes the formalism for 
specifying content models. 

As a main contribution, we initiate a theory of local typing. We introduce 
and study three main notions of locality: local typing, maximal local typing, 
and perfect typing. For a given XML schema language 5, we study the following 
verification problems: 

. Given an iS-typing for a top-down 5-design, determine whether the former 
is local, maximal local, or perfect. We call these problems LOC[5], ML[5] and 
PERF[5], respectively; 

. Given a top-down iS-dcsign. establish whether a local, maximal local, or per- 
fect 5- typing does exist (and, of course, find them). We call these problems 
3-LOC[5], 3-ML[5], and 3-perF[5], respectively; 
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. Given a bottom-up iS-design, establish whether it defines an iS-type. The 
problem is called CONS[5]. 

The analysis carried out in this paper provides tight complexity bounds for 
some of these problems. In particular, for bottom- up designs, we prove that 
CONS[5] is: 

. dccidable in constant time for 7?.-EDTDs, for each TZ\ 

. PSPACE-complctc both for 7^-DTDs and 7?.-SDTDs, in general; 

. PSPACE-hard with an exptime upper bound for dRE-DTDs and dRE-SDTDs. 

For top-down designs, after showing that the problems for trees can be reduced 
to problems on words, we specialize the analysis to the case oi TZ = nFA. In 
particular; 

. LOC[5], ML[5], PERF[5], and 3-PERF[5] are PSPACE-complete when S stands 
for nFA-DTD or nFA-SDTD, and LOC[5] is EXPTiME-complete for nPA-EDTDs; 

. 3-LOC[5] and 3-mL[5] arc PSPACE-hard with an expspace upper bound 
when S stands for nFA-DTD or nFA-SDTD; 

. the remaining problems are EXPTiME-hard with cither coNEXPTIME or 2- 
EXPSPACE upper bounds. 

Related Work. Distributed data design has been studied quite in depth, in par- 
ticular for relational databases Il3l |34| . Some previous works have considered 
the design of Web applications [l^. They lead to the design of Web sites. The 
design there is guided by an underlying process. It leads to a more dynamic 
notion of typing, where part of the content evolves in time, e.g., creating a 
cart for a customer. For obvious reasons, distributed XML has raised a lot of 
attention recently. Most works focused on query optimization, e.g., The 
few that consider design typically assume no ordering or only limited one Q. 
This last work would usefully complement the techniques presented here. Also, 
works on relational database and LDApH design focus on unordered collections. 
Even the W3C goes in this direction with a working group on XML Fragment 
Interchange . The goal is to be able to process (e.g., edit) document frag- 
ments independently. Quoting the W3C Candidate Recommendation: "It may 
be desirable to view or edit one or more [fragments] while having no interest, 
need, or ability to view or edit the entire document." This is clearly related 
to the problem we study here. Finally, the concept of distributed documents, 
as defined in this paper, is already implemented in Active XML, a declarative 
framework that harnesses web services for data integration, and is put to work 
in a peer-to-peer architecture [Hl^- Moreover, XML documents, XML schemas, 
and formal languages have been extensively studied and, although all the prob- 
lems treated in this paper are essentially novel0 the theoretical analysis has got 



•^Lightweight Directory Access Protocol (LDAP) is a set of open protocols used to access 
centrally stored information over a network. 

* Consider that, as highlighted in [31II . an interesting problem in Formal Language Theory 
open for more than ten years, named Language Primality, is essentially a special case of our 
problem 3-L0C[(1fa]- The complexity of Primality has been also settled in [3 Ij . 
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benefit from a number of existing works. Classical results about formal (string 



and regular) languages come from [20|, |23|, |25|, |26|, |32|, |36|, |37|, and in partic- 
ular, those about state complexity of these languages can be found in (22l. lisj. 
those about one- unambiguous regular languages in 0, HH , and those about al- 



ternating finite state machines in [17|, |42| . Finally, regarding XML documents 



and schemas, our abstract presentation builds on ten years of research in this 
field. In particular, it has been str ong ly influenced by document typings studied 
in H H [ll [3 [11 m [sl, [Hil and results on them presented there. 

Structure of the Paper. This concludes the introduction. The remaining of the 
paper is organized as follows. Section[2]fixes some preliminary notation, formally 
introduces our notions of type, distributed XML document, and defines the 
decision problems studied. It also provides an overview of the results. Section [3] 
considers the bottom-up design. Section 2] presents basic results regarding the 
top-down design. Sections [S] and [5] present the main results for the word case. 
Section [7] completes the complexity analysis. Section [8] concludes and mentions 
possible areas for further research. 



2. General setting 

In this paper, we use a widespread abstraction of XML documents and XML 
Schemas focusing on document structure 33, 35], and Active XML termi- 



nology and notation for describing distributed documents [H, [H- In particular, 
for XML Schemas we will consider families of tree grammars (called 7?.-DTDs, 
7?.-SDTDs, and 7?.-EDTDs) each of which allows different formalisms for specify- 
ing content models {TZ may vary among nFAs, dFAs, nREs, and dREs, respec- 
tively, nondeterministic automata, deterministic automata, regular expressions, 
and deterministic regular expressions). This, could be surprising at first sight 
because the W3C standards impose stricter limitations. However, as we will 
informally motivate later, (and has been formally proved in [sij), some of the 
problems we define and analyze here, have the same complexity independently 
of whether we use deterministic or nondeterministic string-automata, or even 
deterministic regular expressions. Informally, it can be observed that the docu- 
ment distribution often erases the benefits of determinism. For this reason, and 
because this paper intends to be a first fundamental study of XML distribution, 
we include in our analysis different possibilities for schema languages, even if for 
some problem we only analyze the most general case {TZ is set to nFAs) in order 
to delimit its complexity. Moreover, the typing problems we study hint at the 
possibility that there could be interesting real world applications (all distributed 
applications that involve the management of distributed data, such as data in- 
tegration from databases and other data resources exported as Web services, or 
managing active views on top of data sources) where W3C recommendations 
are too strict and thus unsuitable in the context of distributed XML documents. 
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2.1. Preliminaries 

In this paper, we use also the following notation. We always denote, by E, 
a (finite) alphabet; by e, the empty string; by 0, the empty language; by •, the 
binary relation of concatenation on E* and by o, its extension on 2^ ; by A, an 
automaton for defining a string-language or tree- language over E; by r a regular 
expression over E; by 7^, a formalism for defining string languages; by 5, a 
formalism for defining tree languages; by r, an TZ-type or an S-type (a concrete 
formal structure defining, respectively, a string languages or a tree languages, 
such as a regular expression or an XML schema document) over E; by [r], the 
language defined by r. 

2.1.1. XML Documents 

An XML document can be viewed, from a structural point of view, as a finite 
ordered, unranked tree (hereafter just a tree) t with nodes labeled over a given 
alphabet E. The topmost node in t is denoted by root(t), while for any node x 
oft, we denote by 

. parenti(x) the (unique) parent node of x (if node x is not the root); 
. childrenf (x) is the sequence of children (possibly empty) of x in left-to-right 
order; 

. treet(x) the subtree of t rooted at x; 
. labf (a;) G E the label of x; 

. anc-strt(2;) G E+ is the sequence of labels of the path from the root of t to 
x; 

. child-stri(.T) G E* the labels of the children of x in left-to-right order. 

In particular, if child-strf (a;) = e, then x is called a leaf node. The size of t, 
denoted by is the number of its nodes. Also in these predicates we may 
omit the subscript t when it is clear from the context. 

2.1.2. Regular String Languages 

A nondeterministic finite state machine (nFA) over E is a quintuple A = 
{K,'E,, A, Qs, F) where K are the states, Qs (z K is the initial state, F <Z K are 
the final states, and A <Z K x (EUje}) x is the transition relation. Each triple 
(q,a,q') G A is called a transition of A. Sometimes the notation q' G A(q,a), 
where A is seen as a function from A' x (E U {e}) to 2^, is more convenient. 
By A* C iiT X E* X X we denote the extended transition relation defined as the 
reflexive-transitive closure of A, in such a way that (q, w, q') G A* iff there is 
a sequence of transitions from q to q' recognizing string w. The set of strings 
[A\ = {w G E* : A*{qs,w) G -F} is the language defined by A. Such machines 
can be combined in various ways (see |22j for a comprehensive analysis). In 
particular, A denotes the complement of A, and defines the language E* — [A\. 
Given two nFAs Ai and A2, we denote by Ai-A2, Ai^A2, A\r\A2, and A1—A2 
the nFA defining [^1] o [^2], [^1] U [^2], [Ai] O [A2], and [^1] - [A2], respectively 
(operators • and o are often omitted). Also, for a set A = {^1, . . . , Am} of nFAs, 
we often write flA (or UA) instead of ^1 n . . . fl Am (or ^1 U . . . U Am.-) 
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A deterministic finite automaton (dFA) over E is an nFA where A is a function 
from K x'Eto K. 



A (possibly nondeterministic) regular expression (nRE or also regex, for short) 
r over E is generated by the following abstract syntax: 

r ::= e \ $ \ a \ {r ■ r) \ {r + r) \ rl \ r^ \ r* 

where a stands generically for the elements of E. When it is clear from the 
context, we avoid unnecessary brackets or the use of • for concatenation. The 
language [r] is defined as usual. 

A deterministic regular expression (dRE) r is an nRE with the following re- 
striction. Let us consider the regex f built from r by replacing each symbol 
a £ E with a' where i is the position from left-to-right of a in r. By definition, 
r is a dRE if there arc no strings wa^u and wa^v in [f] such that i ^ j. The 



language [r] of a dRE r is called one-unambiguous [11 1 



A cartesian product of n finite sets is called a "box" l41[ . More precisely, fix 
a positive number n. Let E be an alphabet. A box B over E is any language of 
the form Ei . . . E„ where n is its width, and E,; C E for each i in [I..71]. Clearly, 
each box is a regular language as it is a finite one. 



2.1.3. Regular Tree Languages 

A nondeterministic Unranked Tree Automaton (nUTA) is a quadruple A = 
{K, E, A, F) where E is the alphabet, K is a finite set of states; F C K is the 
set of final states; A is a function mapping pairs from {K x E) to nFAs over 
K. A tree t belongs to [A] if and only if there is a mapping fi from the nodes 
of t to K such that (i) fi{root{t)) G F, and (ii) for each node x of t, either e or 
/x(children(x)) belongs to [A(/i(x), lab(x))] according to whether a; is a leaf-node 
or not, respectively. 

A bottom-up- deterministic Unranked Tree Automaton (dUTA) over E is an 
nUTA where A is a function from {K x E) to dFAs over K in such a way that 
[A(g, a)] n [A{q', a)] = for each q ^ q' . 

2.1.4. Known decision problems 

In this section we recall some well known decision problems. 

Definition 1. EQUiVj^j is the following decision problem. Given two 5- types, 
do they define the same language? □ 

In particular, whenever we consider two 7^-types instead of <S-types, we still 
denote by EQUiV[-fj] the equivalence problem defined exactly as above. 

Definition 2. ONE-unamB[7j] is the following decision problems. Given a reg- 
ular language L specified by an 7?.-type, is L one-unambiguous? □ 
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2.2. Types 

As already mentioned, we consider abstractions of the most common XML 
Schemas by allowing regular languages, specified by possibly different formalisms 
for defining content models. More formally, let 7^ be a mechanism for describing 
regular languages (nFAs, dFAs. nREs, dREs, or even others). We want to define 
and computationally characterize the problems regarding Distributed XML de- 
sign in a comparative analysis among the three main actual formalisms for 
specifying XML schema documents: W3C DTDs, W3C XSD and Regular Tree 
Grammars (like Relax-NG). For each of these schema languages, we adopt a 
class of abstractions that we call 7?.-DTDs, 72,-SDTDs. and 7^-EDTDs, respectively, 
where 'R, is the particular mechanism for defining content models. We show that 
a number of properties do not depend on the choice of TZ (or even of 5) and for 
some complexity results we focus our analysis to the case of nFAs. Before that, 
we summarize in Table [T] the relevance of the different tree grammars. 



Table 1; Comparison between our abstractions of XML Schemas and existing formalisms. 



Schema language 


Previously introduced formalism 


Our abstraction 


W3C DTDs 


DTDs and ltds 


dRE-DTDs 


W3C XSD 


Single-Type Tree Grammars 
and single-type EDTDs 


dRE-SDTDs 


Relax NG 


unranked regular tree languages 
{specialized ltds and EDTDs) 


nRE- EDTDs 



2.2.1. n-DTD types 

The following definition generalizes definitions considered in the literature 
such as ltds [1, [111 or DTDs 2^ S^l, and defined for analyzing the properties 



of W3C Document Type Definitions. As we marry these views, we define the 
following class of abstractions capturing all of them. 

Definition 3. An 7?,-DTD is formalized as a triple r = (S],7r, s) where 
. S is an alphabet (the element names); 

. TT is a function mapping the symbols of S to 7^- types still over E; 
. s e S is the start symboL 

A tree i, having labels over E belongs to [r] if and only if: lab(root(t)) = s and 
child-str(a;) G [7r(lab(a;))], for each node x of t. For a given element name a, the 
regular language [7r(a)], associated to a, is usually called the content model 
of a. □ 
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Notice that, due to the above definition, TZ-DTDs with useless element names, 
or even defining the empty language, do exist. This is because the above defini- 
tion allows to specify 7?,-DTDs that are, in a sense, "not reduced" (think about 
finite automata with unreachable states). Since it is much more convenient to 
deal with types that are not effected by these drawbacks, after giving some more 
definition, we formalize the notion of reduced types. 

We introduce the dFA dna7(r). It is the language consisting of the set of 
paths from the root to a leaf in trees in [r] and it is in some sense the vertical 
language of r. 

Definition 4. Let t = (I],7r, s) be an 7?,-DTD. We build from r the dual dFA 

dual{T) = {K,Y,,S,qo,F) as follows: 

. K = {qa} U{qa:ae E}; 
■ ^{qo,s) = qs; 

. for each a, G S, 6{qa, b) = q;, iff 6 appears in the alphabet of 7r(a); 



Before defining a set of conditions ensuring that all the content models of a 
given 7?.-DTD r are well defined and have no redundancy w.r.t. the language [r], 
we mark the states of dual{T) (in a bottom-up style) as follows: 

1. Mark each final state of dual{T) as bound; 

2. For each non-bound state qb, consider the set C I] where (S(gb,a) is 
bound iff a G Efc. If [7r(6)] H SjJ" ^ 0, then mark also qi, as bound; 

3. Repeat step 2 until no more states can be marked. 

Definition 5. Let r be an 7^-DTD. We say that t is reduced iff 

. Each state of dual{T) is in at least a path from qo to a final state in F; 
. Each state of dual{T) is bound; 

. [dual{T)] is nonempty. □ 

We consider only reduced 7?.-DTDs where, by the previous definition, it is clear 
that [t] ^ 0. Note that for a given TZ-DW r, it is very easy to build duai(r) and 
for each "unprofitable" state qa 

. remove the element name a from E; 
. remove the rule 7r(a) from tt; 

. modify the rules containing a in their content models (using standard regular 



language manipulation) to produce only words not containing a (see [28[ , for 



more details.) 

Finally, we notice that only the last step of the reducing algorithm may depend 
on the choice of TZ. Clearly, an TZ-DID and its reduced version describe the same 
language. 

From a theoretical point of view, 7^-DTDs do not express more than the local 
tree languages [s^. In particular, nPA-DTDs, dFA-DTDs and nRE-DTDs exactly 



. qaEF iSe (= [7r(a)]. 



□ 
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capture this class of languages while dRE-DTDs are less expressive [33|, [35 [.Nev- 
ertheless, the last class of types (using deterministic regular expressions 11[ and 
that does not capture all the local tree languages) is, from a structural point of 
view, the closest to W3C DTDs. 

In this paper, for a given 7?,-DTD where TZ stands for dFAs or nFAs (for 
shortness, w.l.o.g., and only in examples) we often specify tt as a function that 
maps E-symbols to E-nREs (recall that any regular expression of size n can be 
transformed into an equivalent e-free nFA with O(nlog^n) transitions in time 
0{n\og^n) [13, 111.) 

Finally, an example of dRE-DTD is ti = ({si, c}, tti, si) with 7ri(si) = c* and 
7ri(c) = £. In the rest of the paper, we often omit to specify rules such as 
7ri(c) = e; i.e., if no rule is given for a label, nodes with this label are assumed 
to be (solely) leaves. 

2.2.2. nSDTD types 

The following definition generalizes definitions considered in the literature 



such as Single-Type Tree Grammars [33| or single-type EDTDs [29|, and defined 



for analyzing the properties of W3C XML Schema Definitions. Also here, we 
define a class of abstractions capturing all of them. 

Definition 6. An 7^-SDTD (standing for single-type extended 7?.-DTD) is a quin- 
tuple T = (E, S, TT, s, fi) where 

. E are the specialized element names; 

. (E, TT, s) is an 7^-DTD on E and denoted by dtd(T); 

. ^ : E — > E is a mapping from all the specialized element names onto the 
set of element names. For each a S E, we denote by a^, . . . , a" the distinct 
elements in E that are mapped to a. This set is denoted E(a); 

. Let dua7(dtd(r)) be {K,t,,6,qo, F). Build from this dFA the possibly nFA 
duaJ(r) = {K, E, 6, qo, F) where for each q, q' € K and a € E, 5{q, a) = q' iff 
there is an element a G E such that 5{q, a) = q' . We require that dual{T) is 
a dFA (this captures the single-type requirement). Also in this case, dual{T) 
defines the vertical language of t. 

A tree t, labeled over E, is in [t] if and only if there exists a tree t' e [dtd(T)] 
such that t = iJ,{t') (where /i is extended to trees). Informally, we call t' a 
witness for t. Finally, an 7?.-SDTD r is reduced if and only if dtd(T) is. □ 

As for 7^-DTDs, we consider only reduced 7?.-SDTDs. 

From a theoretical point of view, 7^-SDTDs arc more expressive than 7^-DTDs 
but do not capture the unranked regular tree languages yet. 



2.2.3. n-EDTDs types 

The following definition generalizes definitions considered in the literature 
such as specialized ltds @,[3^ or EDTDs [2^. Such formalisms (like Relax-NG), 
from a structural perspective, express exactly the homogeneous unranked regular 
tree languages and are as expressive as unranked tree automata or Regular Tree 
Grammars [loj . 
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Definition 7. An TZ-EDW (extended 7?.-DTD) r is an 7?.-SDTD without the single- 
type requirement. More formally, the automaton dual{T), built as for 7^-SDTD, 
may be here an nFA. The language [r] is defined as for 7?.-SDTDs. □ 

2.3. Distributed Documents 

In the context of distributed architectures (e.g., P2P architectures), dis- 
tributed documents (or distributed trees), such as AXML documents, are XML 
documents that may contain embedded function calls. In particular, a dis- 
tributed XML document T'[fj..i,^] can be viewed as a collection of (classical) 
XML documents ti..tn brought together by a unique (special) XML document 
T'[fi,...,f„] ; the kernel, some of whose leaf-nodes, called function-nodes, play the 
role of "docking points" for the external resources fi, . . . , f„. The "activation" 
of a node of T having a function as label, say fi, consists in a call to resource 
(or function) the result of which is still an XML document, say ti. When fj 
is invoked, its result is used to extend the kernel T[fj ... f^j. Thus, each docking 
point connects the peer that holds the kernel and invokes the resource f^, and 
the peer that provides the corresponding XML document ti. For simplicity of 
notation, for labeling a function-node we use exactly the name of the resource 
it refers. For instance, the tree Tq = s{a fi &(f2)) is a kernel having s as root, 
and containing two function- nodes referring the external resource fi and ¥2. 

The extension extT{ti..t„) of T is the whole XML document (without any 
function at all) obtained from the distributed document T[ti..t„] by replacing 
each node referring resource with the forest of XML trees (in left-to-right 
order) directly connected to the root of ti . This process is called materialization. 
For instance, the extension of kernel Tq would be s(a c{dd) b{d{ef))) in case of 
resources fi and ¥2 provided trees si{c{dd)) and S2{d{ef)), respectively. 

An interesting task is to associate a type Ti (e.g., a W3C XSD document) 
to each resource in such a way that the XML document ti returned as answer 
is valid w.r.t. this type and any materialization process always produces a doc- 
ument extT(ii--iri) valid w.r.t. a given global type r (still specified by the W3C 
XSD syntax). A global type and a kernel document represent the (top-down) 
design of a given distributed architecture. A collection of types associated to 
the function calls in such a design is called a typing. Given a distributed design, 
we would like to know whether either a precise typing has some properties or a 
typing with some properties does exist. But also, wc could directly start from 
a kernel T and a typing (bottom-up design) and analyze the properties of the 
tree language consisting in each possible extension &xtT{ti..tn). 

More formally, let S and Y,^ be two alphabets, respectively, of element names 
(such as s, a, 6, c, etc.) and function symbols (such as f, g, etc.). A kernel 
document or kernel tree Tjf^ f^j (or also T(ff,), with (f„) denoting a sequenc^ 
of length n) is a tree over (E U E^) where: 

(i) the root is an element node (say sq); 



^We denote a finite sequence of objects (x\ , . . . ,Xn) over an index set I = {1, . . . , n} by 
{x„) and we often omit tlie specification of the index set /. 
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(ii) the function nodes fi, . . . , f„ are leaf nodes; 
(iii) no function symbol occurs more than once. 

In particular, for each non-leaf node of T, say x, the kernel string child-str(x), 
with fc > functions, is of the form Wh^h+iWh+i ■ ■ - ^h+kWh+k (for some h 
in where Wi G S* for each i G {h,...,h + k}, fi G S*^ for each i G 

{/i + 1, . . . , fc}, and fi 7^ fj for each i ^ j. 

We next consider its semantics. It is defined by providing a tree for each 
function- node. In particular, an extension ext maps each i in [l..n\ to a tree 
ti = ext(fi). The extension extT{ti..tn) of a kernel T^f-^ f^-^ is obtained by 
replacing each fi with the forest of trees (in left-to-right order) directly connected 
to the root of ti. 

A type T for a kernel tree T is one of an TZ-mD, 7?.-SDTD, or 7?.-EDTD. Given an 
extension we say that tree T^^-^ t„] satisfies type r if and only if extT(^i..in) 

does. This motivates requirement (iii) to avoid irregularities: For instance, in 
the kernel Ti = s(f f ) the children of s in any extension of Ti are of the form 
WW for some word w. But since this is not a regular language, the type of 
Ti cannot be defined by none of the three adopted formalisms. Although we 
disallow the same function to appear twice, several functions may share the 
same type. Also, even if for labeling a function-node we use exactly the name 
of the resource it refers (for simplicity of notation), this does not prohibit a 
resource to provide two XML subtrees to be attached to the kernel. In fact, 
different names (function symbols) can be associated to the same resource still 
preserving extensions from irregularities. 

We introduce typings to constrain the types of the function calls of a kernel 
document. A typing for a kernel tree T{f„) is a positional mapping from the 
functions in (f„) to a sequence (r„) of types (schema documents). Now, as we 
replace each (in the extensions of T) with a forest of XML documents then, 
for each type Ti associated to f^, we actually use a schema document containing 
an "extra" element name, say Si , being only the label of the root in all the trees 
in [n]. 

Definition 8. We denote by extT(TTi) the tree language consisting of all possible 
extensions extT{ti..tn) where ti \= Ti {ti is valid w.r.t Ti) for each i. □ 

Definition 9. We denote by r(T„) the nFA-EDTD (or nRE-EDTD) constructed 
from T and (t„) in the obvious way such that [T(r„)] = extT(T„). □ 

In Section [3. II we will show precisely how to build T{t„) in polynomial time, 
prove that the construction is semantically correct, and establish that the size of 
T(r„) is purely linear in the size of T and (t„). Let us illustrate for now the issues 
with an example. Observe, for instance, that for the tree T = so{a{b)fia{c)) , 
no matter which type ti is, there is no 7?.-DTD-typing expressing the language 
extT('''i). Indeed, this is even the case for T = so(a(&)a(c)) with no function at 
all. If we consider the tree T = so(a(fi)a(f2)), then the typing [ti] = {si{b)}, 
[T2] = {s2(c)} prohibits that extT(Ti, t'2) is expressible by an 7?.-DTD-type because 
[r(T„)] = {so{a{b)a{c))} entailing that the content model of b is non-regular; 



14 



while the typing [ri] = {si(6)}, [T2] = {s2(6)} ahows that, because [T'(t„)] = 
{so(a{b)a{b))} entaihiig that aU the content models of sq, a and b are regular 
languages, {aa}, {b} and 0, respectively. Such situations motivated Definition 

m 

Before concluding this section, we adapt the previous definitions to strings 
in the straightforward way. (We will often use reductions to strings problems in 
the paper.) Let w{tn) = wotiWi . . .inWn be a kernel string. For typing strings, 
we use 72.-types where TZ G {nFA, dFA,nRE, dRE}. A typing for w(fn) is still a 
positional mapping from the functions in (f„) to a sequence (t„) of 7^-types. By 
extu,(T„) we still denote the string language consisting of all possible extensions 
of w, and by w(t„) the nFA (or nRE) constructed from w and (t„) is such a way 
that [w(t„)] = ext^(T„). 

We will use in our proofs a generalization to "Boxes". A kernel box B{fn) = 
BofiBi . . . fnBn is, here, a finite regular language over (S U E*^) where fi, . . . , f„ 
are as above, and each Bi is a box (of a fixed width) over S. With B{Tn) 
we denote the nFA (or nRE) constructed from B and (r„) is such a way that 
[B(r„)] = extB(r„). 

2.4. The Typing Problems 

In this section, we introduce the notion of distributed XML design, define 
the design problems that are central to the present work, and give the overview 
of the complexity results. Wc consider two different approaches, bottom-up and 
top-down, according to whether the distributed design, other than a kernel tree, 
consists of a typing or a target type, respectively. 

Definition 10. Let iS be a schema language, and T^jf^ f^j be a kernel docu- 
ment. We call iS-design (or just design) one of the following: 

. D = ((T„),T[fj f^]) where (t„) is an 5-typing. This is bottom-up design. 
. D ~ (t, Tjf^ f^]) where r is a (target) tS-type. This is top-down design. 

□ 

Intuitively, given a bottom-up design, one would like to find a global type 
that captures the typing of the global document. On the other hand, given a 
top-down design, one would like to find types for the local documents that will 
guarantee the global type. 

With the following definition, we start the bottom-up analysis. Notice that 
the concepts used for bottom-up design will be also useful when we consider 
top-down design. 

Definition 11. Given an iS-design D = {{Tn),T), the iS-typing (r„) is 5- 
consistent with T (simply consistent when S is understood) if there exists 
an 5-typc r such that [r] = extT(T„), in other words, if extT(T„) is definable by 
some iS-type. This problem (deciding whether an iS-typing is 5-consistent with 
a kernel tree) is called CONS [5]. □ 
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We will denote by typey5(T„), or typey(T„) when S is understood, the 5- 
type when it exists such that [type2.(T„)] = extT(T„). Notice that if both S and 
T are fixed, then typej.(T„) plays the role of a function from the set of all possible 
iS-typings of length n to a set of certain iS-types. According to every possible 
decision-answer of CONS[5] (where T is now fixed), such a function might be 
always definable, never, or only for some iS- typing. Finally, the complexity of 
deciding CONS[5] or computing typerp{Tn) (with an estimation, w.r.t. T{Tn), of 
its possible size), may vary considerably due to S. 

Table [5] summarizes the complexity results of CONSj^j. We vary S among 
7?.-DTDs, 7^-SDTDs and 7^-EDTDs, for various kinds of TZ. In all cases but dRE, we 
get tight bounds. For DTDs and SDTDs with dRE, we provide nonmatching lower 
and upper bounds. The table also shows the size that typer(T„) may have in 
the worst case. Again this is given precisely for all cases but dRE. For DTDs and 
SDTDs with dRE, we provide nonmatching bounds. 

In the next sections, we systematically analyze the complexity of this prob- 
lem by varying S among 7^-DTDs. 72,-SDTDs and 7^-EDTDs, and we will consider 
typej'{Tn) for each of these schema languages. We next give an example to 
illustrate some of the main concepts introduced. 



Table 2: Complexity results of CONS[5] compared with the worst-case-optimal size of typey(T„) 
with respect to m = ||T{Tn)||. 





-DTDs 


-SDTDs 


-EDTDs 


nFA 


PSPACE-complote 

e(m) 


PSPACE-comploto 

e(m) 


DTIME(0(1)) 

e(m) 


nRE 


PSPACE-complote 

e(m) 


PSPACE-comploto 

e(m) 


DTIME(0(1)) 

e(m) 


dFA 


PSPACE-comploto 

e(2-) 


PSPACE-comploto 

0(2") 


DTIME(0(1)) 

e(m2) 


dRE 


PSPACE-hard EXPTIME 

n(2'") «™> 0(2^'") 


PSPACE-hard <™> EXPTIME 

0(2") 0(2^") 


DTIME(0(1)) 

e(m) 



Example 1. Consider the kernel T = so(a fi c £2) and the pair ti = ({si, b}, tti, si) 
and T2 = ({s2, d}, 7r2, S2) of dRE-DTD- types, with 7ri(si) = b* and 7r2(s2) = d*. 
The activation of both fi and f2 may return trees si(66) and S2{d), respec- 
tively. These trees can be plugged into T producing the extension So{abbcd). 
The tree language obtained by considering each possible extension of T is 
extT{Ti,T2) ~ {sQ{ab"cd"^) :n,m> 0}. Now, we have: 

typer(Ti,r2) = {{so,a,b,c,d},TT,so) 
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where 7r(so) = o, b*c d* and all the other element names other that sq are leaves. 
Finally, ('ri,T2) is dRE-DTD-consistent with T. □ 



We now define the top-down design problems. But before, we introduce 
some straightforward notation. Let r and r' be two types. We say that: 

. r = r' (equivalent) iff [t] = [t'] 

. T < t' (smaller or equivalent) iff [r] C [r'] 

. T < r' (smaller) iff [r] C [t'] 

and also that, given two typings (t„) and (t^): 

. (t„) = (t,'J iff Ti = t[ for each i 

■ (tti) < (t^) iff n < t[ for each i 

■ (t„) < «) iff (t„) < (r^) and n < t[ for some i 

Definition 12. Given an iS-dcsign D = {t,T), wc say that a typing (t„) is: 
. sound if extT(T„) C [r]; 

. maximal if it is sound, and there is no other sound typing (r^) s.t. (t„) < 

«); 

. complete if extT(r„) D [r]; 

. local if ext7'(T„) = [r], namely if it is both sound and complete; 

. perfect if it is local, and (r^) < {t„ ) for each other sound typing (r^); 

. /^-consistent if it is an iS-typing which is iS-consistcnt as well. □ 

Remark 1. It should be clear that for a given iS-design D = {t,T) we could 
have sound typings that are not D-consistent. But, note that, it is even possible 
to have a sound typing where T(r„) does not define a regular tree language. 
Consider the design D where T ~ so(fi) and t = so{a^b'^). Clearly, the typing 
[ti] = {si(a"5") : n > 0} is sound but [/"(ti)] is not regular. Anyway, we 
prove in Section [01 (for strings, but the results generalizes to trees due to our 
reductions) that if an iS-design admits a sound typing (r„), then it also admits 
a sound nFA-EDTD- typing (r/J such that (r„) < (r^). 

Also, by definition of maximality, note that for instance, for a given dRE-DTD- 
design D, a dRE-DTD- typing (r„) is not maximal even if there is a sound nFA-DTD- 
typing (t^) for D such that (r„) < (r^). One could have some objection to such 
a definition. Anyway, Martens et al. |3l| proved that whenever the illustrated 
situation happens, then there is also a dRE-DTD- typing (r") such that (r„) < 
«)■ □ 

Clearly, local typings present the advantage of allowing a local verification 
of document consistency (soundness and completeness by definition). Also, no 
consistent document is ruled out (completeness). Maximal locality guarantees 
that in some sense, no unnecessary constraints are imposed to the participants. 
Finally, perfect typings are somehow the ultimate one can expect in terms of 
not imposing constraints to the participants. Many designs will not accept a 
perfect typing. However, there are maximal sound typings which are not local. 
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This is not surprising as there are designs that have at least a sound typing but 
do not allow any local at all, and clearly, if there is a sound typing, then there 
must also exist a maximal sound one. We will see examples that separate these 
different classes further. But before, we make an observation on Z3-consistcncy 
and formally state the problems studied in the paper. 

Let S be any schema language among 7^-DTDs, 7^-SDTDs, and 7?.-EDTDs, 
where TZ G {nFA, dFA, nRE, dRE}. Whenever we consider a top-down iS-design 
D = {t,T), wc require that a typing (t„) for D has to be D-consistent, namely 
both T(t„) is 5-consistent (it has an equivalent 5- type) and each is an iS- 
type. In order to verify such a condition, we can exploit the techniques that we 
have developed for bottom-up design. In particular, it is not hard to see that if 
(t„) is not iS-consistent, then it can not be local. Thus, our approach aims at 
isolating problems concerning locality from those concerning consistency. 

Definition 13. LOC[5], mlj^j, perfj^j are the following decision problems. Given 
an iS-dcsign D = {t,T) and a Z?-consistcnt typing (t„), is (t„) a local, or max- 
imal local, or perfect typing for D, respectively? □ 

Definition 14. 3-LOC[5], 3-mL[5], 3-perF[5] are the following decision prob- 
lems. Given an iS-design {t,T), does there exist a local, or maximal local, or 
perfect ZJ-consistent typing for this design, respectively? □ 

We similarly define the corresponding word problems (S is simply TV). We 
have LOC[7j], ML[7^], PERF[7j], 3-LOC[7j], 3-ML[7j] and 3-PERFr7j]. Finally, we will 
use in proofs box versions of the problems, 3-LOCp^j, 3-MLp|,j and 3-PERFp^j. 

Remark 2. In this paper, although we analyze all the three defined schema 
languages (7?.-DTDs, 7?.-SDTDs, and 7?.-EDTDs) for top-down designs, after pro- 
viding reductions from trees to strings, we specialize the analysis to the case of 
TZ = nFA. More tractable problems may be obtained by considering determin- 



istic content models or restricted classes of regular expressions [18|, |27| as made 
by Martens et al. [3l|. Also, notice that we pay more attention to maximum 
locality rather than to maximality proper. In fact, for the latter notion, the ex- 
istence problem is trivial. Moreover, the complexity of the verification problem 
essentially coincides for both notions. Nevertheless, one could be interested in 
a maximal sound typing when, for some reason, the design can not be improved 
and does not admit any local typing. There could be even cases where a local 
typing does not exist but, there is a unique maximal sound typing comprising 
any other possible sound typing, a sort of quasi-perfect typing. For instance, 
the design T = s(a fi) and t = s{ab* + d) has such a property. Our techniques 
can be easily adapted to these cases, too. □ 

Table [3] gives an overview of the complexity results for the typings problems 
previously defined. We will see in Section 2] that, for 7^-DTDs and 7?.-SDTDs, each 
problem on trees is logspace-reducible to a set of problems on strings (thus, it 
suffices to prove the results in Table [3] for words) and that, for 7?,-EDTDs, the 
problems on trees depend on the problems on boxes in a more complex manner. 
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In particular, row D includes two problems that are actually the same (they only 
differ ii TZ ~ dREs, as shown in Martens et al. [sij). Each number in brackets 
refers either to the corresponding statement/proof in the paper (if rounded) or 
the paper where the particular result has already been proved (if squared). 

Table 3: Complexity results in case of top-down design 





[1] 

nFAs / nFA-DTDs / nFA-SDTDs 


[2] 

nFA-EDTDs 


[A] 


LOC 


PSPACE-complctc tsTst 


EXPTIME-complctc i4?T9t 


[B] 


ML 


in PSPACE [311 V fTJl 
PSPACE-hard i5 2|| 


EXPTIME-iiard f4T8l 
in 2-EXPSPACE jV.lOll 


[C] 


PERF 


PSPACE-complctc SsJl 


EXPTIME-iiard iTTTsl 
in coNEXPTIME iTgll 


[D] 


3-LOC/3-ML 


PSPACE-hard JOl 
in EXPSPACE [3t| V 16. Ill 


EXPTIME-hard ITol 
in 2-EXPSPACE fTAi 


[E] 


3-PERF 


PSPACE-complctc SsM 


EXPTIME-hard ITot 
in coNEXPTIMElTst 



We now present examples that separate the different design properties of 
typings. 

Example 2. Let r ~ {{s,a,b,c},TT,s) be an nRE-DTD where n{s) = a*bc*, and 
T — s(fif2) be a kernel tree. It is easy to see that both si{a*bc*), S2{c*) and 
si{a*), S2{a*bc*) arc local typings as a*bc*c* = a*a*bc* = a*bc*. In fact, they 
are also maximal local typings, and so there is no perfect typing for this design. 
Observe that, for instance, si(a?), S2{a,*bc*) is still a local typing that, however, 
is not maximal because it imposes unnecessary constraints to the local sites. If 
desired, one could leave them more freedom, e.g., type the first function with 
a*. □ 

Example 3. Let r = s{a*bc*) be a type and T = s(fi6f2) be a kernel tree. The 
typing si(a*), §2(0*) is perfect. This has to be an excellent typing since there is 
no alternative maximal local typing. □ 

Example 4. Let r = (ab)* be a type and T = s(fif2) be a kernel tree. The 
typing si((a6)*), S2{{ab)*) is a unique maximal local but it is not perfect. Con- 
sider, in fact, typing Si(a), S2{b). It is sound but (a, b) < {{ab)* , {ab)*) does not 
hold. Clearly, a perfect typing cannot exist. □ 

Example 5. Let r = (afo)+ be a type and T = s(fif2) be a kernel tree. There 
are three maximal local typings: 

si((a6)*), S2{{ab)+) si{{ab)*a), S2{b{ab)*) si{{ab)+), S2{{ab)*) 

according to whether either si, si(a), or none of them may belong to each 
possible ext(fi), respectively. □ 
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The following theorem completes the comparison of the properties of typing 
we study. (Note that its converse is not true by Example ID) 

Theorem 2.1. Every perfect typing is unique maximal local. 

Proof. Consider a perfect typing (r„) for T(f„) and t. We observe that (t„) 
is local, by definition. Moreover, by definition, for each other sound (so also 
local) typing (t^'J, wc have (r^) < (t„). The typing, (t„) is maximal because 
there exists no other sound typing (t") such that (t„) < (r"), and it is unique 
because there is no another local typing (r^') such that for some index i and 
some string w, then w G [r"] but w ^ [r^]. □ 

3. Bottom- up design 

In this section, wc consider bottom-up design. 

3.1. TZ-EDTDs typing 

Let r(f„) be a kernel and (r„) be an 7?.-EDTD-typing where each = 
{Tii,'E,i,'!Ti, Si, fii). We next present the construction of r(T„), that (to be as 
general as possible) is an nFA-EDTD. We use the following notations: 

1. So contains the clement names in T (the labels but not the functions); 

2. So contains a specialized element name a§, for each a G Eg and each node 
X of T with label a. 

3. So is the root of T; 

4. Si is the root of trees in [t;] for each i. 

We also make without loss of generality the following assumptions: 

1. Si n Sj = 0, for each i,j, i ^ j. (Note that E.; n ^j^i may be nonempty.) 
Consider the nFA-EDTD r(r„) = (S, S,7r, so,/^} defined as follows: 

1. S = So U (Si - {si}) U . . . U (S„ - {s„}); 

2. S = So U (Si - {Si}) U . . . U (S~„ - {S„}); 

3. sq = Sq, where x is the root of T; 

4. fi{a) = a for each a G S; 

5. 7r(ag) = nFA({e}) for each leaf-node x of T with label a G So; 

6. 7r(ai) = nFA([7ri(ai)]) for each G S with i in [l..n\ 

7. for each node x of T with label a and children yi . . .yp, we define 7r(ag) = 
nFA(Li . . . Lp) where each language Lj- is 

. {bl"} if yfc has label 6 G S; 
■ [TTiisi)] if yk is labeled by ft. 
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The previous algorithm clearly runs in polynomial time by scanning the tree 
T and preforming some easy regular language manipulation. Also, the size of 
T(Tn) is linear in the size of the input pair T and (r„). This is clearly true 
for 7^ 6 {nFA, dFA} where only a linear number of e-transitions is required. If 
7?. G {nRE, dRE}, it is also true because the translation from regular expressions 
to nFAs produce at most an n log^ n blow up but because in these cases we 
might define T{Tn) directly as an nRE-EDTD-type of actual linear size. These 
considerations immediately yield the following proposition; 

Proposition 3.1. Given T(f„) and (t„), the nFA-EDTD-type T{Tn) can he con- 
structed in polynomial time, and its size is linear in the input pair. 

Now we prove that our construction preserves the semantics of extT(T„). 

Theorem 3.2. Given a kernel T{in) and an TZ- EDTD-typing (r„), [r(r„)] = 
extriTn) holds for each possible TZ. 

Proof. By construction of T(r„), we assume the specialized element names in 
each type of (r„) to be different (in fact, they could always be renamed 
appropriately before building T(r„)). Also, the specialized element names added 
for giving witnesses to the nodes of T labeled with an element name belong to 
a fresh set (it is Eq)- This means that there is no "competition" among all of 
these witnesses. So we just create new content models that exactly allow all 
and only the trees being valid for each and the non-function nodes that are 
already in T. But this is exactly the semantic definition of extT(r„). □ 

Corollary 3.3. All the problems CONS^^p^.^OTij, CONS[iFA-EDTiJi, COI^S[„re-edti^ , 
and CONS[^iiE_Ei)XD] always have a yes answer. Thus, they are decidable in constant 
time. 

Proof For CONS[nFA-EDTD] , CONS[dFA-EDTD] , and CONS[nRE-EDTD] the dccision-auswer 
is always "yes" because each content model in T{Tn) is, respectively, aheady 
an nFA, expressible by a dFA, and expressible by an nRE. 

For CONS[dRE_EDTD] the decision-answer is always "yes" as well, but the reason 
is less obvious. In general, there are regular languages not expressible by dREs. 
Anyway, in our case, by considering how tt is built in T(t„), we arc sure that 
each content model has an equivalent dRE. In fact, 7r(ai5) = ^ (step 4) is already 
a dRE; 7r(ai) = nFA([7ri(di)]) (step 5) has an equivalent dRE because 7ri(ai) is 
already a dRE by definition; 7r(ag) = nFA(Li . . . Lp) (step 6) is expressible by a 
dRE because each Lk originates itself from a dRE and does not share any symbol 
with any Lj^k- CH 

By Corollarv l3.3[ we now give a safe and easy construction of type'p(r„) from 
T(t„) according to the schema language S used for (r„). 

. For nFA-EDTDs, we choose type7.(r„) = T(t„); 
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. For dFA-EDTDs, we modify T(t„) by computing the e-closure for each content 
model. Notice that this can be done in polynomial time and the size of 
typey(r„) is at most quadratic (and there are cases where this could really 
happen) in the size of T(t„) because each content model originates from 
dFAs that do not share any symbol. 

. For nRE-EDTDs or dRE-EDTDs, we modify the content models of T(t„) as 
follows: each n^di) = 77^(0^) and each 7r(a§) = Ri . . . Rp where the generic 
Rk is either fog*" or TTi{si) (compare with the T{t„) definition). Also here the 
size of typey(T„) is linear in the size of T(Tn) due to the previous corollary. 

3.2. TZ-SDTDs typing 

For 7^-SDTDs we also use T(r„) as defined for 7?.-EDTDs because any 7^-SDTD 
can be seen as a special 7?.-EDTD and the algorithm for building T(r„) still works 
with no problem. At this point, it should be clear that T(r„) can easily not be 
an 7?.-SDTD because of our assumptions (E^ n '^j^i = 0). But, in this case it 
is also possible that T{Tn) does not have an equivalent 7?.-SDTD. Indeed, r(f„) 
may contain some pattern that already prohibits obtaining an 7?.-SDTD for any 
possible typing (t„), or it may contain a function-layout that prohibits obtaining 
an 7?.-SDTD for some (t„). So, we have to discriminate when this is possible or 
not. Such a problem (deciding whether an 7^-EDTD has an equivalent 7^-SDTD) 
is in general (when TZ stands for nREs or nPAs) an EXPTiME-complete problem 
[jgj l. Nevertheless, we will show that, in our case, it is almost always "easier" 
and in particular that, in general, it depends on the complexity of equivalence 
between tree-languages specified by nPA-SDTDs, which, in turn depends on the 
complexity of equivalence between string- languages specified by nPAs. Before 
giving proofs of that, we illustrate the definition of distributed document using 
nRE-SDTD- types. 

Example 6. Let T = so(fi a{h £2) c) be a kernel tree and ri,T2 be two 
nRE-SDTD- types describing respectively b ■ d+ ■ a(6+)* and b* . In the nRE-SDTDs 
syntax, ti = ({si, a, 6, d}, {si, ai, 61, di}, tti, si, //i) and T2 = ({s2,fc}, {hM}, 
7r2, S2, ^2) two types where: 

. So = {soi «j c}; So = {sqi %^ Cq, }, where {1, 3, 4, 6} are the nodes of T 

with label in Sq (based on a preorder traversal of T\) 
■ 7ri(si) = bi-dl -al; 7ri(ai) = 6+; 7r2(s2) = ^2; 7^i(^i) = T^iidi) = TT2{b2) = e; 
. ^1 and /.J2 are clear; 

For instance, the activation of both fi and f2 may return trees si{bda(hbb)) and 
52(66), respectively. In general, the resulting type is so(6 • c?+ • a{b~^Y' ' c)- It can 
be described by an nRE-SDTDs. Thus, (ri, T2) is an nRE-SDTDs- typing consistent 
with T. □ 

Now we need to introduce some definitions and mention previous results. 

Lemma 3.4. Let t = (S, S, tt, s, /i) be an TZ-SDTD. For each a G S, also 
T(d) = (S, S, TT, a, fi) is. 
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Proof. By definition of 7^-SDTD (in the worst case, if r is reduced, then T(a) 
may be not.) □ 

Definition 15 ((29j). A tree language L is closed under ancestor-guarded 
subtree exchange if the following holds. For each ti, ^2 £ L, and for each xi,X2 
in ti,t2, respectively, with anc-strt^ (.ti) = anc-strf2 (2:2), the trees obtained by 
exchanging treetj(a;i) and treet2(2;2) are still in L. □ 

Lemma 3.5 ((29|). A tree language is definable by a TZ-SDTD iff it is "closed 
under ancestor- guarded subtree exchange" and each content model is defined by 
an TZ-type. 

Remark 3. Intuitively, this means that the witness associated by an 7^-SDTD- 
type T to a node x of a tree t £ [r] only depends on the string anc-strt(x). This is 
consistent with the definition of dual{T) as a dFA. In fact, the (unique) sequence 
of states that dual{T) scans for recognizing anc-strt(a;) (except the initial one) 
exactly gives the unique witness to each node of t in the path from the root to 

X. □ 

Proposition 3.6. [Ilj 

1. There is an equivalent dRE for each one-unambiguous regular language; 

2. Let A be a minimum dFA. There is an algorithm, that runs in time 
quadratic in the size of A, deciding whether [A] is one-unambiguous; 

3. There are one-unambiguous regular languages where the smallest equiva- 
lent dRE is exponential in the size of the minimum equivalent dFA. (This 
is worst- case optimal;) 

4. There are one-unambiguous regular languages where some nRE is exponen- 
tially more succinct than the smallest equivalent dRE. In particular, the 
language {(a + by"b{a + by^ : m < n, n > 0} has such a property; 

5. The set of all one-unambiguous regular languages is not closed under con- 
catenation. 

Corollary 3.7. 

(1) Problem ONE-UNAMB[„;i£'] is in exptime. 

(2) For each nRE defining a one- unambiguous grammar, there exists an equiva- 
lent dRE which is, at most, doubly exponential in size. (An exact bound is 
still open.) 

(3) There are pairs of dREs the concatenation of which, by a string separator, 
defines a one-unambiguous language such that the smallest equivalent dRE 
has an exponential size. 

Proof. (1): Let r be an nRE. Build, in polynomial time from r, an equivalent 
nFA A. Run the quadratic-time algorithm described in [Ilj on the minimum 
dFA (at most exponentially larger) equivalent to A. 

(2): By ProDOsition l3.6[ the dRE r' that we construct from the dFA A, introduced 
in (1), has at most size exponential in the size of A. Thus, the size of r' is at 
most doubly exponential in the size of r. 
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(3): Let ri = (a + 6)"* and r2 = (a + &)" be to nREs, with m < n. By definition, 
it is clear that they are also both dREs linear in n. Consider the new nRE 
r = ribr2. By the previous proposition, r defines a one- unambiguous language 
but its smallest equivalent dRE is exponentially larger. □ 

Lemma 3.8 (Q)- Problem ONE-unamB[„;jb] is PSPACE-hard. 

Definition 16. CONCAT-univ^t^] is the following decision problem. Let E be 
an alphabet. Given two 7?.-types ri and T2 over S, is [ti] o [T2] = 1]*. □ 



Lemma 3.9 ( [25|, ISll |32[ ) . CONCAT-univjt^] is pspace- complete for each TZ € 
{nFA, nRE, dFA, dRE}. 

After introducing some necessary definitions and results, we are ready for 
proving the following theorem. It is fundamental for pinpointing the complexity 
of CONS[7j_sDTDs] , for giving size-bounds about typey(T„) and the guidelines for 
constructing it. 

Theorem 3.10. Let r(f,i) be a kernel and (t„) be an TZ-SDTD-typing. 

1. If TZ G {nFA,nRE} (nondeterministic and closed under concatenation), 
then CONS[Ti_sDTD] polynomial-time Turing reducible to EQUlV[7j_g£)jD] and 
typerp{Tn) is not larger than r(r„),' 

2. If TZ = dPA (deterministic and closed under concatenation), then prob- 
lem CONS[7^_sfl2xi] is polynomial-time Turing reducible to EQUlVj^^^.gjjjj,] and 
typerp{Tn) has unavoidably a single- exponential blow up w.r.t. T(t„) in the 
worst case; 

3. IfTZ = dRE (not closed under concatenation) , then CONS^-jz-sbtd] is polynomial- 
space Turing reducible to ONE-UNAMBjnjj^^ . There are cases where the size 
of typej^^Tn) is, at least, exponential in the size ofT{Tn). A doubly ex- 
ponential size is sufficient in the worst case. (The exact bound is still 
open.) 

Proof. First of all we observe that, by construction, the only content models of 
T(t,i) that might not satisfy the single-type requirement are those related to the 
witnesses of the non-leaf nodes of T. More formally, let x be any non-leaf node 
of T the label of which is denoted by a, the content model 7r(ag) of its (unique) 
witness ag is the only one(s) that may contain some conflict. All other content 
models either refer to leaves (e is single-type) or come from some (that is 
already single- type.) 

Case 1. Proof Idea: Consider (r„) being simply an 7?.-EDTD. Build T(t„) 
and (from it) typerp(Tn) = (S, S, tt, sq, jj) (both in polynomial time as described 
in Section 13.11) and try to "simplify" the latter (in a bottom-up way starting 
from the nodes of T having only leaves as children and going on to the root) for 
satisfying the single- type requirement. If the algorithm does not fail during its 
run (C0NS[7j_sDTD] admits a "yes" answer), then the resulting typejn(r„) is now 
an 7^-SDTD. During the proof we only make use of the "ancestor-guarded subtree 
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exchange" property, and so, by Lemma 1531 we can conclude that if we cannot 
simphfy type'r(r„), then it does not have an equivalent 7^-SDTD. Moreover, due 
to the simplification process that does not change the structure of T(r„) but 
only merges some specialized element names, then the resulting type is at most 
as large as the original one. Finally, to check the subtree exchange property we 
only use equivalence between 7?.-SDTDs and the number of performed steps is 
clearly polynomial in the size of T(t„) witch, by Proposition 13. 1[ is polynomial 
in T and (t„). 

More formally, for each node a; of T having only leaves as children and of 
course an element name as label, say a, and for each pair of children y ^ z of 
X, do: 

1. If both y and z are not function nodes and have the same label, say b. As 
7r(6g) = 7r(6J) = e (by definition) we can consider hereafter, by Lemma 
13.51 6q and 6g the same element. 

2. If only one of the two, say y, has an element name as label, say 6, while 
z has a function as label, say f^, and TTi{si) contains in its specification 
an element bi (at most one, as Ti is already an 7?,-SDTD), by Lemma [3.51 
if [typerp{Tn,bi)] = then we can consider hereafter, 6, and the 
same element; otherwise we can conclude that typerp{Tn) does not have an 
equivalent 7?.-SDTD. 

3. Finally, if both y and z are function nodes having label fi and fj, respec- 
tively, for each element name in E, say 6, if both Tri{si) and TTjisj) contain 
in their specifications the elements bi and bj (at most one for each of 
them, as and Tj arc already 7^-SDTD), by Lemma if [typejn(T„, bi)] = 
[typey(T,i, bj)] (by construction, this can be done by deciding whether the 
two 7?.-SDTDs Tiibi) and Tj(bj) arc equivalent), then we can consider here- 
after, bi and bj the same element; otherwise if for some bi and bj this is not 
true, we can conclude that typejn(T„) does not have an equivalent 7?,-SDTD; 

If the corresponding condition is satisfied for each y and z, then we can conclude 
that 7r(ag) comphes with the single-type requirement, that type-r (t„, ag) has an 
equivalent 7^-SDTD (obtained by applying the previous steps), and that it can 
be used for checking equivalences when wc consider the parent of x, its children 
and (some modifications of) the three previous steps (see further.) 

If type2n(r„, ag) has an equivalent 7^-SDTD for each considered node x, then 
the next iteration considers each node x' of T having only leaves as children 
or a node already analyzed. We perform Step 3 exactly as above, while Step 
1 or Step 2 with the following trivial changes. Let y be, now, a non-leaf node 
(instead of a leaf one) : 

1'. If both y and z are not function nodes and have the same label, say b. By 
Lemma [3.51 if [typejn(T„, 6q)] = {b{)} wc can consider hereafter, b^ and 6g 
the same element; otherwise we can conclude that typerp^Tn) does not have 
an equivalent 7?.-SDTD; 

2'. If only one of the two, say y, has an element name as label, say 6, while 
z has a function as label, say f^, and iTi^Si) contains in its specification an 
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element bi (at most one, as is already an 7?.-SDTD), by Lemma 13. 5[ if 
[typerp{Tn,bi)] = [typej^(r„, 60)], then we can consider hereafter, bi and 
the same element; otherwise we can conclude that ty perp{Tn) does not have 
an equivalent 7?,-SDTD. 

Finally, if we reach the root of T and after checking equivalences on its chil- 
dren we can conclude that tt{so) complies with the single-type requirement, 
then type'p(T„, sq) = typey(T„) is now (after merging the prescribed specialized 
element names) an 7^-SDTD. 

Case 2. li TZ = dFA then, when we merge some specialized element names 
in the same content model, we can obtain an nFA. So, we can still invoke the 
EQUlV[nFA-sDTD] problcm but the size of typey(r„) may be exponential as we want 
it to be a dFA-SDTD, and there are cases for which this may happen already 
by concatenating two dPAs [i^. Given that, the blowup cannot be larger than 
single-exponential, this bound is optimal. 

Case 3. TZ = dRE then, when we merge some specialized element names 
in the same content model, we can obtain (due to the concatenation and by 
Proposition 13. 6p an nRE that may not be expressible by a dRE. We can still 
invoke the EQUiv^nRE-sDiD] problem (as necessary condition) but we also have to 
invoke the ONE-unambjupie] problem (at least as hard as the first one). Notice 
that this new check docs not compromise the soundness of the algorithm. In fact, 
for each possible dRE-SDTD (if any) equivalent to typej^(r„) the unique witness 
that can be assigned to x, due to Lcmma l3.5[ must define the same language as 
7r(a§) by applying /x to them. Finally, if both the two decision problems answer 
yes, then we can consider a new iteration of the previous algorithm. In case that 
each content model has an equivalent dRE specification and we reach the root 
of T, we can conclude that typerp{Tn) is now a dRE-SDTD. By Proposition 13.61 
there are cases where typey(T„) may require, at least, single-exponential size. 
By Corollarv l3.71 a doubly exponential size is sufficient in the worst case. □ 

We now have the following result: 

Corollary 3.11. 

(1) Problems CONS^nHE-SDTiil o,nd CO"^S[nFA-SDT0i PSPACE-complete; 

(2) Problem CONS[dFA-sDTi^ is PSPACE-complete; 

(3) Problem CO'NS[^gE_sDTi^ is both PSPACE-hard and in exptime; 

Proof. Membership. For (1) and (2) consider that both equiv^ure-sdtd] and 
EQUlV[nFA-sDTD] ^rc fcasiblc in pspace [29[. While for (3) we also consider that 
ONE-UNAMB[nRE] is doablc in EXPTIME, by CoroUarv 13.71 

Hardness. For (1) we know that both equivjjire-sdtd] and equivjufa-sdtd] are 
also PSPACE-hard \2^ . 

For (2) and (3) we directly consider a reduction from CONCAT-univjt^] (psPACE-hard, 
by Lemma [3?9|) to problem CONS[7^.sdtd] ("^ G {dFA, dRE}). In particular, let Ax, 
A2 be two 7^- types, we consider the consistency problem for the kernel tree 
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T = s(a(fif2) a(f3)) and the 7?.-SDTDs typing (ti,T2,T3) where the trees in 
ri,T2 have only one level other than the root, 7ri(si) = Ai, 772(52) = ^2, and 
[7r3(s3)] = S*. It is easy to see that (ri,r2,r3) is consistent with T if and only 
if [Ai] o [A2] = 1]*. □ 

We conclude this section with a remark. 

Remark 4. The exponential blow-up affecting typey(T„) may suggest that 
there are cases for which it may be better to store an XML document in a 
distributed manner keeping each part valid w.r.t. its local (and small) type 
rather than validate the whole document w.r.t. a very large type. □ 



3.3. TZ-DTDs typing 

Even for 7?.-DTDs we use T(r„) as defined for 72.-EDTDs. But here the algo- 
rithm we introduced for TZ-EDTDs docs not work any more because an 7^-DTD- 
typing is structurally different from an 7?.-SDTD or an TZ-EDTD. 

Let T be a kernel and (t„) be an 7?.-DTD-typing. Before building r(T„) 
we construct, from (t„), an equivalent 7?.-SDTD- typing (r^) as follows. Let 
Ti = {T,i,TTi,Si) be the i*'' type in (t„). Consider the 7^-SDTD-type n = 
(Ej, Ei, TT^, Si, ^i) defined as follows: 

. a e Ei iff a e E.^; 

. /ii is a bijection between Ei and E^; 
■ '^[{a) = ^l''^{^T^{a)). 

The two types are trivially equivalent. So, we can build the new nFA-EDTD- 
type (or nRE-EDTD-type) , representing extT(T„), by using (r^). But since the 
overhead of constructing (r^) is completely negligible, we still denote it by T(t„) 
instead of r(r^). 

Also in this case we would like to decide whether r(T„) has an equiva- 
lent 7^-DTD-type or not, and even the general problem (when TZ stands for 
nREs or nFAs) of deciding whether an TZ-EDTD has an equivalent 72.-DTD is 
EXPTiME-complete [i^. As for 7^-SDTD, we will show that in our settings we 
can do better. 



Definition 17 ((35|]). A tree language L is closed under subtree substitu- 
tion if the following holds. Whenever for two trees ti, ^2 ^ L with nodes Xi and 
X2, respectively, \zbt^{xi) — Iabt2(a::2), then the trees obtained, from ti and t2, 
by exchanging treet^(a:i) and treet2(a::2) are still still in L. □ 



Lemma 3.12 ([35|). A tree language is definable by an TZ-DTD iff it is "closed 



under subtree substitution" and each content model is defined by an TZ-type. 

The following theorem (with the related corollary) concludes the set of results 
for the bottom-up design problem, and gives the last guidelines for constructing 
typey(T„) or evaluating its size. 

Theorem 3.13. Let r(f„) be a kernel and (t„) be an TZ-DTD-typing. 
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1. If TZ G {nFA,nRE}, then CONS[7^_otd] is polynomial-time Turing reducible 
to EQUiV[7^_s£,jii] and typerp{Tn) is linear in T(t„); 

2. IfTZ— dFA, i/ien CONS[7^_x)773] is polynomial-time Turing reducible to 'EQIjiy \^nFA-sDTi}\ 
and typerp{Tn) has unavoidably a single- exponential blow up w.r.t. T{Tn) 

in the worst case; 

3. If TZ — dRE, then CONS[-r._£)j-£)] is polynomial- space Turing reducible to 
ONE-UNAMB[„fl£] and there are case where typerpirn) is, at least, exponen- 
tially larger than r(r„). A doubly exponential size is sufficient in the worst 
case. (The exact bound is still open.) 

Proof. Build the 7^-SDTD-typing (t,'J from (r„) as said before. Perform, from 
T and (r^), the decision-algorithm defined in the proof of Theorem 13.101 by 
enforcing, due to Lemma I3.12[ the additional constraint at the end of each 
macro-step when we assert that type-^ (r„ , dg ) has an equivalent 7^-SDTD: 

. [/i(7r(a'))] = [^(7r(a"))] for each d',d" already considered; 

Finally, notice that the (polynomial number) additional steps are special cases 
of calls to EQUiV[7^_sDTO], and that the same observations made for 7^-SDTDs hold 
for t\/perp{Tn) as well. □ 

Corollary 3.14. We have the following results: 

(1) Problems COTSiS^nRE-DTi)^ o,''^d CO^iS^nFA-DTi}^ PSPACE-complete; 

(2) Problem CO"^iS[iFii-DTi3\ is PSPACE-complete; 

(3) Problem CONS[dij£-Drii] is both PSPACE-hard and in exptime; 

Proof Membership. As for 7?.-SDTDs (see Corollary [3. llj) . 

Hardness. For (1) we know that both EQUiV[nfiE_DTD] and equivjufa-dtd] are also 
PSPACE-hard (29l. |32|. 

For (2) and (3) we use the same reduction (from CONCAT-univj^j]) that we 
have used in Corollary 13. Ill where the problem CONSjt^.sctd] is replaced now 
by CONS[7^_i5XD] • In particular, we just notice that also in this case {ti,T2,t^) is 
consistent with T = s(a(fif2) a{{^)) if and only if [7ri(si)] o [7r2(s2)] = [7r3(s3)] = 
E*. □ 



4. Top-down design 

In this section, we consider design problems where we start from a kernel 
and a given global type, and we show how to reduce each of these problems on 
trees to a set of typing problems on strings. In the next section, we will show 
how to solve the problems for strings. 
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4.1. Tl-DTDs 

We briefly present some obvious results on equivalence of 7^-DTDs. The proof 
of the next result is obvious and thus omitted. 

Proposition 4.1. Two reduced TZ-DTDs ti and T2 are equivalent if and only if 
the following are true: 

1. They have the same root; 

2. They use the same element names; 

3. For each element name a, the content models of a in both are equivalent. 

Theorem 4.2. Let D = {T,T{fn)) be a distributed design where t = (I],7r, s) 
is an TZ-DTD. The following are equivalent: 

(1) D admits a local TZ-DTD-typing; 

(2) The TZ-design ~ (7r(/a6(x)), child-str{x)) admits a localTZ-typing for each 
node X in T where lab{x) G S. 

Proof. (1) ^ (2): Let (r„) be a local typing for D, then type7^(T„) = r holds. 
This means (by Proposition l4.ip that for each node a; in T such that lab(x) G E, 
the content model 7r(lab(x)) of x has an equivalent specification in typej.(T„). 
But this means that the subset of types in (t„) in bijection with the functions 
of child-str(x) represents a local typing for as well. 

(2) => (1): Also in this case, by Proposition 14.11 since each node x such that 
lab(a::) G S has a local typing, this means that such a typing allows describing 
exactly the content model 7r(lab(a;)). Thus, by combining all the local typings of 
the various string-designs with the content models of r we obtain a -D-consistent 
typing also local for D. To be more precise, we now show how to exploit the local 
string-typings for building a local typing for D. First of all we observe that, for 
each i in [l..n], there exists only one node x oiT such that fi is in the kernel string 
child-str(a;) of . Since each admits a local typing, then there is a sequence, 
say (rf*'', . . . , T^*''), of string- types (one for each function) allowing that. In 
particular, if for some x, child-str(a;) has no function, then this necessarily means 
that admits a trivial local typing, namely [7r(lab(x))] = {child-str(a:)} must 
hold. Let i be an index in [l..n], and x be the parent of f;. The new type (not 
necessarily reduced) = {Y^i^'Ki^Si) is defined as follows: 

. Si E U {si}; 

. TTi contains all the rules of tt and the extra rule 'Ki{si) = rf*'". 
Finally, it is very easy to see that, T(r„) is structurally equivalent to r. □ 

Corollary 4.3. The problems i.OC[n-DTD\, MLf^.^c] vy,-rfyr,_[,td\, ^-^OC\ji-dtd\, 
3-ML[7j_£ijY)] and EI-PERF[7^_£)jd] are logspace Turing reducible to LOC[7j], Mhyj^, 
PERF[7^], 3-LOC[7^], 3-ML[7^] and 3-PERF [7^] , respectively. 
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Proof. Let D = {{T,,tt, s),T(f„)) be a top-down 7e-DTD-design. 

Consider, firstly, the 3-loC[-r,_dtd] problem. Scan T in document order, which 
is well known to be feasible in logarithmic space [3| • For each node a; in T such 
that lab(x) € E, solve the problem 3-LOC[7j] for the design D^. 

If we consider, instead, the problem L0C[7j_dto], as (r„) is ZJ-consistent, then 
typey(r„) exists and there are no different content models for the same element 
name. So it is also enough to scan T in document order, and for each node x 
in T such that lab(x) £ S, solve the problem loCjt^] for the design and the 
subset of types from (t„) in bijection with the functions in D'^ . 

For the maximal and perfect requirements, as they are specializations of 
the local requirement, it is enough to observe that, by Theorem 14. 2 [ they only 
depend on the structure of the various . □ 

4.2. nSDTDs 

Before proving that a similar reduction still holds for 7^-SDTDs, we need a 
proposition and a new definition. 

Proposition 4.4. Let ti ~ (Ei, Ei, tti, si, /ii) and T2 ~ (£2,^2,772,521/^2) be 
two reduced TZ-SDTDs. If they are equivalent, then for each i,j in [1..2], and 
each hi S there is Oj G Sj such that ^i{TTi(ai)) = ^j{'Kj{aj)). 

Proof. As Ti is reduced, for each specialized element name ai G Si there is a 
tree f £ [ri] such that its unique witness t' contains at least one node having ai 
as label. So, let us fix ai, i, and a node x oi t such that anc-str(a;) ends with 
the element name a. As ti and T2 are equivalent, there must exist also a unique 
witness t" for t produced by T2. Let us denote by 02 the specialized element 
name associated to x in t" . As [ti] = [T2], if /^i(7ri(ai)) ^ f^2{'^2{a2)), then we 
would violate the ancestor-guarded subtree exchange property. □ 

Definition 18. Let D = (r, T{i„)) be a distributed design where the type 
T = (S, S, TT, s, n) is an 7?.-SDTD. For each node x in T such that lab(x) S E we 
denote by = (7r(a), w^) the the unique string-design induced by D, where a 
is the (unique) witness assigned by r to x. Moreover, u;^; = e if a; is a leaf, and 
it is the string obtained from children(a;) by changing each non-function node 
with the corresponding (unique) witness assigned by r, otherwise. □ 

Theorem 4.5. Let D ~ (r, r(f„)) be a distributed design where the type r = 
(E, S, TT, s, //) is anTZ-SDTD. The following are equivalent: 

(1) D admits a local TZ-SDTD-typing; 

(2) Each TZ-design induced by D admits a local TZ-typing. 

Proof. (1) => (2): Since D admits a local 7^-SDTD-typing, say (r„), then (r„) 
is 7^-SDTD-consistent with T, and type'p(T„) = r holds. For each node a; of T 
having an element name as label, say a, consider the unique witness associated 
by r and typey(r„) to x, say Ot and ct, respectively. By hypothesis, both 
TTT^Or) and t:t{olt) satisfy the single-type requirement, and by Proposition 14. 4i 



30 



/Xx(7rT(aT)) = /^T(7rT(aT))- Thus, as TTriar) and the children of x have a local 
decomposition induced by (r„), then also iTTiflT) and the children of x (namely 
D^) have. 

(2) =► (1): We show, if the premise is true, how to build (r„) in such a way 
that ly'pej'{Tn) = t holds and in particular that, from a structural point of view, 
typey(T„) is equivalent to t. We denote by (rf*'', . . . , r^*'') one possible sequence 
of string-types satisfying contemporarily all the string-designs. In particular, if 
for some node a; of T the kernel string of has no function call this necessarily 
means that admits a trivial local typing, namely yLt(7r(a)) = child-str(a;), 
where 5 is the unique witness assigned by r to x. In particular, for each function 
ti consider its parent node in T, say x. The new type = (EijIli^Tri, Si, fii) is 
defined as follows: 

. = E U {s,}; 

. S, = E U {si}; 

. TTi contains all the rule of tt and the extra rule 7ri{si) = r/*'"; 

. Si is the usual extra witness for the root of any tree in [r^]; 

. fii is defined as fi and also ^i{si) = Si. 

It is very easy to see that, if we build r(r„) without renaming the specialized 
element names we obtain exactly r, and so typey(T„) = t. In particular, when 
we assign the witnesses to the non-function nodes of T we choose exactly those 
assigned by r. The only difference may be in the specification of the content 
models because the recomposition after a decomposition may produce a dif- 
ferent structure (for instance a different nFA) being, anyway, equivalent to the 
original one. Notice that, the "ancestor-guarded subtree exchange" property is 
guarantied because we also require that all the designs without any function call 
admit local typings. For instance, if = (7r(a), e) admits a local typing, where 
x is a leaf node of T and a is its witness assigned by r, this necessarily means 
that 7r(a) = {e}, and we automatically take it into account when wc build each 

Ti. □ 

Corollary 4.6. The problems L0G[ti-sdtd], ^'^^[n-SDTD], peRF[k_soti5, 3-LOC[7^_si,n,], 
3-ML[-ji_sDTD\ cmd B-PERF^-pi.sDTD] OLTG logspuce Turing reducible to LOC[7j], MLjtjj, 
PERF[7j], 3-LOC[7j], 3-ML[7^] and 3-PERF[7j], respectively. 

Proof. Exactly the same as for 72,-DTDs. □ 



4.3. TZ-EDTDs 

Although 7?.-EDTDs have nice properties simplifying the CONS[7j_edtd] problem 
and the construction of type'p(r„) (when we start from a kernel and an 7^-EDTD- 
typing), things dramatically change when we consider the problems concerning 
locality. The freedom of using, in the same content model, various specialized 
element names for the same clement name has a price. Consider the following 
example. 
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Example 7. Let D = {t,T) be a dRE-EDTD-design where T = so(fif2) and r = 

(I]^l:,7r,So,M)- In particular, 7r(Sa) = a.\b^)* + d'^ (P)* ; TT{d'^) = 5^; 7r(V) = 
7r(6^) = + g^; 7r(5^) = + h^. It is not hard to see that the string-design 
{7r(so), fif2) admits only two maximal local typings: 

(e, d\b^)* + d^b"")*) {d\b^y + d^{P)\ e) 

But, only the first one is also maximal for Z?, while the actual second one is 
{d\b^Y + d^(b^)*, (53)*) where [r2(53)] = 6(g). □ 

The problem highlighted by the previous example originates from the fact 
that and 6^ can not be considered completely distinct as 5^ and (notice 
that [T(a^)] n [T(a^)] ~ 0), and as we naturally do for two different symbols of 
an alphabet in string languages, yet they are witnesses for two sets of trees with 
a nonempty intersection. In fact, [t(6^)] n [t(6^)] — b{g) can be part of T2 in the 
second maximal local typing for D. 

From this, it is unclear whether, by only analyzing content models (such as 
7r(so), in the previous example), we can decide whether a given design admits at 
least a local typing. Clearly, if we apply to both (a^(6^)*+a^(6^)*)-(6'^)* and to 
7r(so) we obtain the same string-language, namely ab* , but unfortunately, this is 
only a necessary condition and even if {ab* , b*) is a maximal local typing for ab* , 
it is not clear how to assign the witnesses for obtaining (a^(6^)* -|-a^(6^)*, (b^)*)- 

The following theorems, give a further idea of the higher complexity of lo- 
cality when we consider 7?.-EDTD-designs. 

Theorem 4.7 ([s^ [s^l)- Problems EQUW [^FA-EDTiii (^^d ^QW^ [ure-edti^ ft'^e 
EXPTIME- complete. 

Theorem 4.8. Problems ^-i^OC\ji_edtd]: ^-^^[r-edtd]! and3-i>YRF\ji_Er)Ti^ are at 
least as hard as equiV[7j_£bjb] . 

Proof. We define a logspace transformation ip from equiV[7^_e0td] to 3-loC[7^_edtd] • 
Afterwards, we show that the statement also holds for the other two problems 
by using exactly the same reduction. Let r', t" be two arbitrary 7?.-EDTDs. The 
application of (p to this pair produces the design D = (r, T) , where 

. T = So(fi cfa) ^ 

. 7r(so) = TZ{diCidi + 6101^2) 

. 7r(di) = 7?.(sq), where s'q is the root of the trees in [r'] 
. 7r(d2) = T^{sq), where Sq is the root of the trees in [r"] 
. 7r(ai) = 7r(6i) = 7r(ci) = TZ{e) 

. ci does not appear in any other content model of r and c appears exactly 
once in any tree in [t] 

Informally, [r] = so{acd{[T'])+bcd{[T"])) . First of all, we observe that all the new 
content models (other than those being already in r' and r") can be represented 
by 7?.-types, even dREs. Now, it is easy to see that D admits a local typing iff 
[r(Ji)] = [T(d2)] iff r' = r". It is [n] = si{a + b) and [T2] = S2(d([r'])). Finally, 
we just notice that if t' = t" holds, then (ti,T2) is the unique maximal local 
typing for D which is even perfect. □ 
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Corollary 4.9. Problems 3-'LOC[tz.edtd], ^-m^^u-edtd] , and 3-PEKF [tz_edtdI are 
EXPTiME-hard if TZ € {nFA, nRE}. 

The equivalence between [r(di)] = [r(d2)], in the previous reduction, is 
necessary because we do not know, a priori, whether fi is imposing a constraint 
on f2 or not. In particular, this is an extreme case of the fact that [r(di)] n 

What we really need is to be able to consider completely distinct, in the same 
content model, each pair of different specialized element names of the form d 
and a', namely [rfg)] n [T(a')] = 0. To do that, given an 7^-EDTD, we construct 
an equivalent nUTA [3^ , we transform it into an dUTA [l^ i and finally we try to 
derive a new 7<!.-EDTD satisfying our requirement. If 7?. = dRE the last step could 
not be always possible. 

Given an 7?.-EDTD r (S, S, tt, s, ^), an equivalent nUTA A = {K,'E,A,F) 
can be constructed as follows: K — 'S; A(d, a) = nFA(7r(a)), for each a G E; 
F = {so}. Now we want to transform A into an equivalent dUTA A'^ (that may 
be exponential in size). Notice that A"^ will have only one final state as well. 
Finally, we convert again A''' (whenever it is possible) into an 7^-EDTD t"^ as 
follows: S = if ; 7r(a) = TZ{A{a, a)), for each d £ K. 

Lemma 4.10. Let t'^ he an TZ-EDTD built as above. For each element name, 
say a, and each pair d,a' of different specialized element names in 'S'^fa), then 

[T^ia)] n [T^id')] = 0. ■ 

Proof. It is easy to see that by a (bottom-up) run of A"^ over each tree t G 
[r'^id)] U [r''(a')], there is only one possible state (between d and a') that can 
be associated to the root of t, and the states of A"^ coincide with the specialized 
element names of t''. □ 

Now we are ready for handling 7?.-EDTDs (that we call normalized) satisfying 
the above property. But before we introduce a general property of 7^-EDTDs. 

Proposition 4.11. [2^ Letr be anTZ-EDTD. Whenever for two trees ti,t2 G [t] 
with nodes xi and X2, respectively, there are witnesses ti and t2 assigning the 
same specialized element name to both xi and X2, then the trees obtained, from 
ti andt2, by exchanging treetj(a;i) and treet2{x2) are still in [t]. 

The following lemma holds for general 72,-EDTDs but is it also useful for 
normalized 7?.-EDTDs. Consider the design D = {t'^,T) where T = so(a(fi) £2) 
and r'' is a normalized nRE-EDTD having 7r(so) = (a^ +a^)+ (we ignore the other 
content models). As [r''(a"'^)] n [r''(a^)] — 0, it is clear that the unique maximal 
local typing (ti,T2) for D has 7ri(si) = 7r(a^) + 7r(a^) and 7r2(s2) = (a^ + a^)*- 
Thus the node under the root labeled by a may have either d^ or d^ as witness 
depending on the tree replacing fi. 

Lemma 4.12. Let D — {t,T) be an TZ-EDTD- design and (r„) he a local typing 
for D. For each node xofT having an element name as label, say a, there is a 
set of specialized element names C I](a) such that IJ^ggx ['''(a)] = [T{Tn, 5g)]. 



33 



Proof. By hypothesis, [r] = [r(r„)]. We recall that this equivalence is obtained 
when we consider all the possible extensions of T. Let a; be a node of T having 
an element name as label, say a, and k be the cardinality of E(a). First of all, 
we observe that if ~ S(a), then 

[T(T„,ag)]C[r(ai)]U...U[r(a'^)] 

is trivially true. In fact, in such a case, the first member has to be necessarily a 
subset of the second one because, otherwise, in some extension of T there would 
be a subtree rooted in x which r cannot produce any witness for. 

Starting from = E(a) wc claim that each , with 1 < j < /c, is either a 
"friend" to keep in E^ or an "intruder" to remove from E^. Finally, the resulting 
E^ will prove the statement. Consider now each d^ . We distinguish two cases: 

1. We say that is a friend if and only if [T(a-')] C [r(r„,ag)] because it 
then clearly contributes to prove the statement. We leave it in E^. 

2. We say that d^ is an intruder (and we remove it from E^) if and only if 
one of the following is true: 

. [T(a'')] n [T(T„,ag)] = because even if wc remove it from E^, then 
[T(r„,a§)] C UQ6S-['r(a)] is still true. 

. [T{d^)] n [T(r„,ag)] ^ 0, [T{d^)] ^ [T(r„,5g)], and for each (sub)tree sh 
in the intersection there is not a tree ti € [T(r„)] having sti as subtree 
rooted in a;, such that at least a witness ti from r associates d^ to x. It is 
still an intruder because for each (sub)tree sti in the intersection and for 
each possible trees ti S [T(t„)] having sti as subtree rooted in x, since 
tree ti must necessarily have a witness ii from r, then ii associates d^^^ 
to X entailing that each sti is contained also in [T(a')]. Finally, even if 
we remove d^ from S^, then [T(r„,ag)] C IJaGS=" ['''('^)] ^s stiU true. 

. [t(5^)] n [T(r„, 5i5)] ^ 0, [t(5^)] ^ [T(r„, d^)], and for some (sub)tree sti 
in the intersection there is a tree ti G [r(T„)] having sti as subtree rooted 
in X and there is a witness from r associating d^ to x. Anyway, this 
last case is not possible because it would contradict the hypothesis that 
(r„) is local. In fact, consider a tree t2 € t where some node y has also 
witness d^ and the subtree rooted in y is st2 S [T(a^)] — [T(r„,ag)]. For 
each node z in ti such that anc-strtj(z) = anc-strjj(x) and the witness 
for z in ti is a-', by Proposition l4. 1 1"! we may replace (in ti) each subtree 
rooted in z with st2. The new tree is still in t but cannot be obtained 
by any possible extension of T as [T(r„,ag)] contains all the possible 
trees obtainable by all the extensions of the functions under x and st2 
is not among them. 

Thus, if we consider all the feasible cases, then the claim is true as well as the 
theorem: E^ is exactly the set of all friends (or a possible subset if some [T(a-' )] 
can be obtained by the union of some [T(a')] with a' still in E^). □ 

Definition 19. Let D = {T,T{fn)) be an 7?.-EDTD-dcsign where the type r = 
(E, E, TT, s, fi) is normalized. Wc denote by 
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. K any function associating to each node x oiT either a set C E(a) if a is 

the label of or the set {f} if f is the label of x. 
. D% = {tt{k{x)), B^), for each node x in T with lab(x-) G S, the box-design 

induced by D and k where cither = {e} if x is a leaf node, or B^ = 

K{yi) . . . K{yk) if children(x) = yi . . . j/fc. 

Given a sound typing (t„) for D, we say that 
. K is induced by the pair (t„) and T if, for each non- function node x of T, 

k{x) contains exactly all the specialized element names associated to x by 

validating each possible tree in extriTn). 
. ft' < K iff k'{x) C k(x), for each x. □ 

The intention is to relate locality properties about D with locality properties 
about each D^^ similarly as we made for 7?.-SDTDs, with the difference that here 

depends on the choice of k. Unfortunately, although r is normalized, if D 
admits local typings, then k may not be unique. Consider the following example. 

Example 8. Let D = {r,T) be a normalized dRE-EDTD-dcsign where T = 
so(fia(f2)f3), T = (S,S,7r, so,m), 7r(so) = (0^0^)+, 7r(a^) = 6\ and Tr{d'^) = c^. 
We have two successfully mappings k^, such that 

. K^ixi) = so, ^'(^3) = a^ = {{&^a^)+,f^a%), and D^l = (d^h) 
From them wc have two different maximal local typings for D: 

Notice that they are substantially different and also that from the other possible 
mapping k'^, where ^'^(xa) = {a^, a^}, we cannot derive any local typing because 
if £2 is replaced by b, then £3 must start with a(c), and if £2 is replaced by c, 
then fi must start with a{b). But ((a^a^)*a^, b^ +c^, a^(a^a^)*) is neither local 
(even) nor sound. □ 

Now we prove the main results of this section. 

Theorem 4.13. Let D = {T,T{tn)) be a distributed design where t is a nor- 
malized TZ-EDTD. The following are equivalent: 

(1) D admits a local typing; 

(2) There is a function k, as defined above, such that each box-design admits 
a local typing. 

Proof. (1) ^ (2): Let (t„) be a local typing for D, then T{Tn) = r holds. 
Consider the function k induced by (t„) and T (the choice is consistent with 
Lemma I4.12p . As r is normalized, there is only one possibility for validating 
(in a bottom-up way) each tree in extT(T„). If for some node x of T the box- 
design Z?^ did not admit any local typing, then there would be no possibility of 
generating all the strings in 'k{k{x)). Contradiction. 

(2) (1): If for some k each box-design admits a local typing, then we 
can construct each type as made for 7^-SDTDs, in such a way that T(t„) is 
structurally equivalent to r. □ 
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Corollary 4.14. Problem 3-LOC[tz.edti^ (or 3-ML[ti_edj-d] butTZ^ dRE) for nor- 
malized TZ-EDTDs is decidable by an oracle machine in nf^ where C is the 
complexity class of solving (or 

Proof. Let r = (E, tt, s) be a type and T(!/„) be a kernel. Consider the 
3-LOC[7j_EDTD] problem and the following algorithm: 

1. Guess: the function k; 

2. Check: call over for each node x of T with lab(a::) e S. 

For 3-ML[7^_EDTD] we use the same algorithm since, in general {TZ ^ dRE), a 
maximal local typing always exists if there is a local one. □ 

Problem 3-mL[;jre-edtd] will be discussed in Section [T] 

Theorem 4.15. Let D = {T,T{fn)) be a distributed design where t is a nor- 
malized TZ-EDTD. The following are equivalent: 

(1) D admits a perfect typing; 

(2) There is a function n such that each admits a perfect typing, and for 
each sound typing (r/J for D, k' < k where k' is induced by (r^). 

Proof. (1) => (2): Let (r„) be the perfect typing of D and k be the function 
induced by (t„). By Theorem 14. 14[ each box-design admits a local typing, 
and clearly it is perfect as (r„) is. Finally, we observe that since (r/J < (t„), 
then (r^) can not induce in k! more elements than (t„). 

(2) (1): As we made for 7?.-SDTDs, the typing (t„) that we can construct by 
the local typings of the various Df^ (without renaming the specialized element 
names) together with the needful content models already in r produces a type 
T(t„) structurally equivalent to r. □ 

Corollary 4.16. Problem 3-Y'¥R¥\ji_edtd] for normalized TZ-EDTDs is polyno- 
mial time reducible to . 

Proof. Let r = (E, tt, s) be a type and T(!/„) be a kernel. 

Proof Idea: Build k in polynomial time and in a top-down style (this is the 
technical core of the proof) by assuming that a perfect typing exists. Thus, call 
3-PERF^j over for each node a; of T with lab(x) e S. 

More formally, let x be the root of T and m be the number of its children. 
Consider the following steps: 

1. Build from child-str(a:;) a dRE, that we call r{x)^ as follows. For each j in 
[l..m], 

. if child-str(a;)[j] is an element name, say a, then replace it with the set 
Sj (a) (where the subscript means that all the specialized element names 
are renamed with j as subscript); 

. else, if child-str(x)[j] is a function, then replace it with S* (j has the 
same meaning as above); 
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2. Build from 7r(so) an TZ-type t{x) by replacing, in the alphabet of 7r(so), 
each symbol of the form with d{, . . . , a^. 

3. Perform the intersection L = [r{x)] D [t{x)]. 

4. For each child y oi x having a as label and position j, then /t(y) contains 
all the elements of the form d such that dj is in the alphabet of L. 

As we know for each child y of x, then we repeat the previous steps for 
the children of y by considering 7r(K(y)) instead of 7r(.so)- We will stop when we 
reach the leaves of T. 

The algorithm is correct because if we have a look at the alphabet of L, 
we see that it contains, for each j in [l..m], exactly the specialized element 
names that we need to associate to the j*'' child of x because are induced by 
all possible local typings. Intuitively, if the alphabet of L contains, for instance, 
d{ this means that there is a sound typing for D that induces d^ for the first 
child of X, and if the alphabet of L does not contain, for instance, there is no 
sound typing for D inducing r for the third child of x. □ 

Now we consider the remaining complexity result that does not require any 
reduction to strings. 

Theorem 4.17. Problems hOC\ji_EDTi^, ^'i^[R-EDTD]! o.'^'d 'P'EKF\ji_edtd] ^''e o,t least 
as hard as equivjtj.^djj,] . 

Proof. We define a logspace transformation ip from equiV[7j_edtd] to LOC[7j_edtd]- 
Afterwards, we show that the statement also holds for the other two problems. 
Let T',r" be two arbitrary 7?.-EDTDs. The application of to the this pair 
produces the design D = {t,T) and the typing ri, where [r] = so{[t']), T = 
So(fi), and [ri] = si([r"]). Since T(ri) is exactly so(['''"]), it is clear that 
T = T(ri) if and only if t' = t". Finally, we just notice that r' = r" iff ti is 
both perfect and maximal local as T consists of just a function node other than 
the root. □ 

Corollary 4.18. Problems LOC^npji.Eoj-ij^, ML[„^^_£oj.£,], and PERF[„p^_£j,jjj are EXPTiME-hard. 

Theorem 4.19. Problem i^OC^nFA-EDTiii EXPTiME-complete. 

Proof. (Membership) Let D = (r, T) be an nFA-EDTD-design and (r„) be a 
i)-consistent typing. Build T(r„) in polynomial time (by Proposition 13. ip and 
check in exponential time if T(r„) = t (by Theorem 14. 7p . 

(Hardness) By Corollary IHHl □ 

5. The typing problems for words 

We study in this section the typing problems for words. (Recall that most of 
our problems for trees has been reduced to problems for words.) We present a 
number of complexity results. We leave for the next section, two issues, namely 
PERFjup4] and 3-PERF[npA] , for which we will need a rather complicated automata 
construction. We start by recalling a definition and a result that we will use 
further. 
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Theorem 5.1 ([s^I)- EQUiV[„fj,] is PSPACE-complete. 

The hardness of the EQUiV[npA] problem is used to show some hardness resuUs 
of our problems. 

Theorem 5.2. Problems LOC[nFA], ^^ufa], PERF[„fj)] are PSPACE-hard. 
Proof. We define a logspace transformation ip, in such a way that 

EQUIV[nFA] <m LOC[nFA] 

Afterwards, we show that the statement also holds for the other two problems. 
Let A, Ai be two arbitrary nFAs. The application of ip to the pair A, Ai 
produces the design (r, w) and the typing ti , where t — A, w = ii and ti — Ai. 
Since w{ti) = Ai, it is clear that r = w{ti) if and only if .4 = .Ai. Finally, we 
just notice that A = Ai if and only if ti is both perfect and maximal local as 
w consists of just a function. □ 

We now consider upper bounds. Section [6] will show that PERF[nFA] is in 
PSPACE. We next show that LOC[nFA] is. 

Theorem 5.3. LOC[„ir^] is in pspace (so it is PSPACE-complete) . 

Proof. Let it;(f„) be a kernel string, r be an nFA, and (t„) be a typing. Since 
the new automaton w(t„) has size 0(11^11 + |(t„)|), we can check in polynomial 
space if w{Tn) = t. □ 

The proof that also mlj^fa] is in pspace requires more technical insights 
and it is deferred to Section [T] 

Let us turn to the hardness of the 3- versions of the problems. 

Theorem 5.4. 3-LOC[nFA]! =1-mL[„^^], and 3-perf [^fa] are PSPACE-hard. 

Proof. We define a logspace transformation (p, in such a way that the following 
relations hold: 

(1) EQUIV[„fa] <m 3-LOC[nFA]; 

(2) EQUIV[nFA] <m 3-ML[„FA]; 

(3) EQUIV[„FA] <m 3-PERF[nFA]- 

Let Ai = (is:i,Si, Ai,si,Fi), A2 = (iTz, E2, A2, S2, -F2) be two nFAs. The 
application of tp to the pair {Ai,A2) produces the design D = {Ajw) where 

. w ~ fi c {2, with c being a fresh terminal symbol which does not belong to 
(Si UE2); 

. while automaton A = {K, S, A, s, F) is defined as follows: (i) K = Ki U K2 
U {s,pc,9c}; (ii) S = El U S2 U {a,&,c}; (iii) A = Ai U A2 U {(s,a,pe), 
(s,6,pc), {Pc,c,qc), {qc,e,si), {qc,e,S2)}; (iv) F = i^i U F2. 
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Intuitively, if we consider Ai and A2 as nREs, then A is {acAi + bcA2)- 

We claim that there is a local typing (similarly, maximal local, or perfect) 
for D if and only if Ai = A2- First of all, we observe that transformation ip is 
extremely simple and it is clearly in logspacc. In fact, string w is a constant, 
while the choice of a terminal symbol which does not appear in Ai nor in A2 
can be done in logspace, and also A can be obtained by merging Ai and A2 with 
a constant number of transitions. We prove the statement for (1) and we just 
notice that whenever there is a local typing for D, then the typing {{a + b),Ai) 
is perfect (thus, also maximal). 

(=>) If there is a local typing for D then Ai = A2- Since A ~ {acAi+bcA2), then 
[ac^i] and [60^2] form a partition of [A]. In this case, any local typing must have 
the following form {{aXi + 6X2), Y) where Xi,X2, Y are nFAs. Clearly, all the 
strings accepted by w are obtained by aXicY and bX2cY. Then cAi = XicY 
and CA2 = X2CY must hold. But since any string in [^1] or [^2] does not start 
with c, then necessarily [Xi] = [X2] = e. This way, Ai = Y and A2 = Y and 
then Ai = A2- 

If Ai = A2, there is a local typing for D. This part of the proof is trivial 
because {{a + b),Ai) always represents a local typing for D. □ 

We now have lower bounds for all these problems and some upper bounds. 
We will derive missing upper bounds using the construction of automata that 
we call "perfect" for given design problems. 

6. Perfect automaton for words 

Wc next present the construction of the perfect automaton for a design word 
problem. The perfect automaton has the property that if a perfect typing exists 
for this problem, it is "highlighted" by the automaton. This will provide a 
PSPACE procedure for finding this perfect typing if it exists. 

Let A ~ (/i, S, A, s, F) be an nFA. We can assume w.l.o.g. that it has no 
e-transition. Given two states g;, qf in K, a string w in S* is said to be delimited 
in A by qi and q/ if {qi,w,qf) G A*. By exploiting this notion, the sets of all 
the states delimiting w in A are defined as follows: 

Im{A,w) = {qi e K : 3qf G K s.t. iqi,w,qf) S A*} 

Fin{A,w) = {qf G K : Bqi G K s.t. {qi,w,qf) G A*} 

In particular, iiw~e, these two sets arc Ini{A, e) — Fin{A, e) — K . Ini{A, w) 
is called the set of initial states while Fin{A, w) is the set of final states for 
the word w. Given two states qi,qf in K, the local automaton A{qi,qf) = 
{K' C K, S, A', qi, {qf}) induced from A by qi,qf is a portion of A containing 
all those transitions of A leading from qi to qf. More precisely, for each pair 
of states q,q' in K and for each symbol a in E, (q,a,q') G A' if and only if 
there are two strings u,v in S* such that: {qi,u,q) G A*, [q,a,q') G A, and 
[q',v,qf) G A*. Finally, given two strings wi,W2 in S+, then A{'Wi,W2) is 
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the set of all local automata induced by wi and W2- It is formally defined as 
A{'Wi,W2) = {A{qi,qf) : Qi G Fin{A,wi), qf € Ini{A,'W2)}- In particular, if 
Wi = e for some i in the kernel string contains consecutive functions. In 

particular for the previous definitions we have: 

A{'Wi,e) = {A{qiTqf) : qi G Fin{A,wi) and qf G K} 

A[e,W2) = {Aiq^qf) : qi e K and qf G Ini{A,W2)} 

Aie,e) = {A{q^,qf ) : g,,?/ G ii'}. 

Similarly, given a string it; in S* , A{'w) is the ,sei o/ all local automata induced 
by w. It is defined as A{w) = {A{qi.,qf) : {qi,w,qf) G A*} and in particular 
^(e) ~ {A{q,q) : q G A'} is a set of \K\ automata, one for each state in K. 



automaton A >' 



W = a !fi 1/2 d 




Figure 7: A perfect automaton (construction) 

Let w{in) be a kernel string and A be an nFA. The perfect automaton 
w.r.t. A and w consists of several local automata suitably joined together by 
e-transitions. It is denoted by w) (or Q, when it is clear from the context 
who are A and w). Algorithm 1 describes how to build the perfect automaton 
(assume that any pair of local automata have disjoint sets of states labeled as in 
A), while Figure [7] shows the perfect automaton obtained by a given finite state 
machine and a kernel string. We say that A is compatible with w if the set of all 
(legal) local automata in r2 is not empty after correction steps, or equivalently, 
if there exists at least a sound typing. Moreover, 
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Algorithm 1 PerfectAutomaton(w, ^) 



1. Input: w{fn) = WoiiWi . . . t„w„, A = {K, E, A, s, F) 

2. Output: n(A w):=0 

3. for each automaton W £ A{wo) do 
> add W to n 

4. for each i in [l..n] do 

t> for each automaton X G Aiwi-ijWi) do 

a. add X to SI 

b. for each automaton W £ do 

- if \^he\{qfl^{W)) = label(g,„,(X)) 

■ add the transition {qj^^{W),e,q^^j^{X)) to SI 

c. for each automaton W £ Aiwi) do 

- add W ton 

- if \ahe\{qfl^{X)) = label 

■ add the transition (q^yj(X), e, gjjjj(M^)) to Cl 
//Correction steps: 

5. for each automaton W £ ^(?fo) do 

- if label(gj„j(VK)) / s //if wo ^ e 

■ remove W from Q //it is illegal 

6. merge all automata in Q being in A{wo) according to their 
labels and use the (unique) initial state as initial state for 

7. for each automaton W £ A{w„) do 

- if label(Q^„(W)) £ F 

■ F{Q) = F{Q) U {qfiniW)} 
else //if Wn ~ e 

■ remove W from Q //it is illegal 

8. for each automaton A £ Q do 

- if (there is no path from 5j^j(f2) to A or 

there is no path from A to any final state of Q) 

■ remove A from SI //it is illegal 



. Seq{n) denotes the set of all the sequences Wo,Xi,Wi, . . . , A„, Wn of con- 
nected automata in fl such that: Wo is an automaton in A{wo), while Wi 
and Xi are, respectively, in A{wi) and A{'Wi^i,Wi) for any i in 

. Typ{n) = {(A,) : Wo, Ai, VFi, . . . , A„, W„ € S'eg(f})} is the set containing 
all different typings (Ai, . . . , A„) from any sequence in Seq(fl); 

. Aut{fli) — {Xi : (Ai, . . . , A„) £ r?;p(ri)} is the set of all legal automata in 
A{wi_i,Wi); 

. Slj = L}Aut{^li) is the type obtained by the union of all automata Aui(Slj); 
. is the typing for w and A obtained from fi. 

Let (An) be a sequence of automata. We define the direct extension of (An) 
as the set of string defined as [(^n)] — {"i • • • | for each i Ui E [Ai]}. 
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Lemma 6.1. For any nFA A, then ^ < A holds. On the other hand, A < ^ 
does not hold in general. 

Proof. Given a string u in [17], then there exists a sequence (T2n+i) of au- 
tomata in Seq{Q?) accepting u and expressible as A{s,qQ), A{qQ,si), .4(si,gi), 
A{qn-i,Sn), A{sn,qn) for some states go, si, 9i ■ • ■ , Sn, gn- Moreover, by 
definition of direct extension, for each string uqctiUi . . . tT„u„ in [(t2„+i)] we 
have that mq £ [-4.(s,go)], G [-^(^i-i, Sj)] ^-i^d S [y^(si,gi)], for each i in 
But, by definition of local automata, the following sequence of transi- 
tions (each of which belongs to A*): (s,U)o,go): (901 O"!, (Sl7 W^l, 
(g„_i, (T„, s„), (s„, Wn-, q-n), where g„ 6 F, is also derivable by A. 

For the second part of the proof consider the string w = afc and the dRE 
abc -\- d. □ 

Lemma 6.2. Let w{fn) be a string compatible with an nFA A. Any typing in 
Typ{VL) is sound for w and A. 

Proof. Given any typing (X„) in Typ{VL), by definition, there is a sequence 
{T2n+i) of automata such that Xi — T2i for each i in By Lemma 16.11 

(T2ri+i) < A holds. Moreover as, by definition, the extension of is 
[w{Xn)] = {woaiwi ...(JnWn cTi & [Xi], 1 < i < n} . Then w(X„) < (t2„+i) 
as well since all strings wo, . ■ . ,Wn are accepted by ri, T3 . . . , r2n+i, respec- 
tively, by definition of local automata induced by a single string. Therefore, 
w{X„) <A. □ 

Theorem 6.3. Let w{tn) be a kernel string compatible with a given nFA A, and 
(t,i) be a sound typing for them. Then, both z«(t„) < fl and (t„) < hold. 

Proof. Since (r„) is sound for w and A, then w(t„) < A holds. In partic- 
ular, for each string x = wqctiWi . . . anWn in [w{Tn)], where each at £ [r^], 
there is a sequence of states qo, si,qi . . . , s„, g„ proving the membership of x 
in [A] by the following sequence of transitions (s,wo,go) £ (^Oi ""ii ^i) G 
A*, {si,wi,qi) e A*, (g„_i,cr„,s„) e A*, (s„, w„, g„) G A* where g„ G F 
holds as well. But, this means that the sequence A{s,qo), A{qQ,si), A{si,qi), 
. . . , A{qn-i, Sn), A{sn, qn) of automata belongs to Seq{^), so i«(t„) < f2 holds. 
Moreover, since each A{qi-i,Si) e Aut{Q,i), it follows that < Q,i for each i, 
that is (t„) < (f2„). □ 

Corollary 6.4. Lei w(f„) &e a kernel string compatible with a given nFA A, 
and (Tn) be a local typing for them. Then, w{Tn) = fl = A holds. 

Proof. By Lemma and Theorem 16. 31 □ 

Theorem 6.5. Let w(fn) be a kernel string and A be an nFA compatible with 
w. There is a perfect typing for w and A if and only if w{fln) = A. If so, the 
perfect typing is exactly (fin)- 
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Proof. (=►) if there is a perfect typing for w and A then u;(r2„) = A. li w and 
A admit a perfect typing, say (t„), then (as it is also sound), by Theorem 16. 3i 
(Tn) < i^n)- Suppose that (r„) < (ri„) held. There would be (at least) an 
i in [l..n] such that n < Hi. In other words, there would be an automaton 
T- G Aut{V,i) accepting some strings rejected by n. Consider the typing (r^) G 
Typ{Vt) containing t[ in position i. By Lemma |6.2[ [t'^) is sound and then 
Ti < Ti, by definition. But this is a contradiction. Therefore (t„) = (ri„) and 
then w{fln) = A, as (t„) is also local. 

{<=)if w(fln) = A then there is a perfect typing for w and A. This is true since 
(r2„) is local and because, by Theorem 16.31 (t„) < (r2„) for any sound typing 
(t„). □ 

The following two examples show that if there exists a local typing (t„) for 
w and A^ then (r„) < (57„) might hold. This can happen even if (r„) is a unique 
maximal local. 

Example 9. Consider the string w — a ii c (2 e, and the regular expression 
T = abccde compatible with w. Clearly, the typing (6, cd) is local (sound and 
complete) for w and r because w{b,cd) = r. Nevertheless, (1^2) = {bc7, eld) 
is (strictly) greater then (6, cd) since [6c?] = {6, 6c} D {6} and [eld] = {d, 
cd} D {cd}. □ 

Example 10. Let w = a fi £2 d be a kernel string and t be the regular ex- 
pression a{bc)*d. Clearly, the typing ((6c)*, (6c)*) is local (also unique maximal 
local but not perfect). But, as consequence of the construction of perfect au- 
tomaton, we have: Aut{ni) = {(6c)*, (6c)*6} and Aut{n2) = {(6c)*, c(6c)*}. 
Consequently, fii = ((6c)*6?) and = (c?(6c)*) do not represent a sound (and 
hence local) typing since they allow strings such as abccbcd or abcbbcd that are 
not accepted by r. □ 

The following example shows that even if there is no local typing for w and 
T, then = T may hold. 

Example 11. Let a r be the regular expression ab + ba and w — fif2. There 
are two sound typings: (a, 6) and (6, a), but there is no local typing. However, 
n = T. □ 

We can now use the perfect automata construction to characterize the com- 
plexity of PERF[nFA] ■ We use the next lemma: 

Lemma 6.6. Let w({„ ) be a kernel string and A be a k-state nFA. The algorithm 
for building the perfect automaton ^l{A,w) works in polynomial time. 

Proof. Any set A{wi) or A{wi^i,Wi) contains at most k^ automata each of 
which having size 0{k). Therefore, the number of macro-iterations of the algo- 
rithm are 0{nk^), while the size of f2 is 0{nk^). For each Wi, the sets Ini{A, Wi) 
and Fin{A, Wi) can be obtained in nondeterministic logarithmic space (thus in 
polynomial time) because for any pair of states qi,q2 in A, we check if the 
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string Wi is in the language [yl((7i, (72)]- Finally, all the automata in A{'Wi) and 
A^Wi-ijWi) are nothing else but different copies of A having different initial 
and finial states. □ 



Now, we have: 

Theorem 6.7. perF[„^^] is in pspace. So it is also pspace- complete by Theo- 
rem \5.'d 

Proof. Let w(fn) be a kernel string, r be an nFA, and (t„) be a typing. Construct 
the perfect automaton r2(r, w). By Lemma 16.61 can be built in polynomial 
time w.r.t. |r| + Then, check in polynomial space if w{Q,n) = r = w(Tn). 

□ 

And w.r.t. finding a perfect typing (if it exists), we have: 

Theorem 6.8. 3-PERF[„f^] is in pspace. So it is also PSPACE-complete by 
Theorem \5.4\ 

Proof. Let (t, w(f„)) be a (string) design. Construct the perfect automaton 
Q{t,w). By Lemma 16.61 fl can be built in polynomial time w.r.t. |t| + ||w||. 
Then, check if w(fln) = r, which is feasible in polynomial space. □ 

6.1. Additional properties 

We now show how to exploit perfect automaton properties to find (maximal) 
sound typings when a design does not allow any perfect. Clearly, this technique 
can be used for seeking (maximal) local typings as well. Let w(fn) be a kernel 
string and A be an nFA- type compatible with w. All the automata belonging 
to Aut{Qi) can be decomposed in at most 2l"^"*(^*)l — 1 different automata such 
that there are no two of them accepting the same string. In particular, this new 
set is denoted by Dec(ili) and defined as follows: 

Dec{n^) = {nAi - UA2 : 7^ Ai C AutiQi), A2 = Aut{n^) - Ai} 

An example for three automata is given in Figure[8] Finally, Dec{fl) = {(-Di, . . . , Dn) ■ 
Di G Dec{U,i)} is the set of all different typings from Dec{Q.i) x . . . x Dec(r2„). 
Given a typing (t„), we say that (r„) G Dec{Q.) if there exists a sequence 
{Dn) £ Dec{n) such that ti = Di, for each i. 

Given a type t < fli for some i in Dec{T,i) = {tDt' -.t' € Dec{^li)} 

denotes the partition of r, namely UDec(r, i) = r, obtained by its projection on 
Dec{Qi). Let (r„) be any typing for a kernel string w(fn). Given a string m G S* 
and an i in then (t„)[^.|„] denotes the new typing obtained from (r„) by 

replacing Ti with the minimal dPA accepting only the string u. In particular 
[w(t„)[^.|„]] is defined as {wqc iWi . . . (T„w„ : ai = u,aj G [tj] \/j ^ i} and 
clearly, 

w{Tn) = y w{Tn)[r,\u] 

ue[Ti] 
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001 =A-(BllC) 
010 =B-(AU C) 

on = (A {^B)-c 

100 = C-(Ay}B) 

101 =(AC\ C)-B 

110 = (S n c)-A 

111 =^ n5 n c 



Figure 8: Partitioning of (three) sets and enumeration of the parts 



Wc now define an extension of (t„) as the new typing obtained from (t„) by 
replacing with the new type (r^ Ur), and denoted by (Tn)^^-.^,-]. In particular, 

Me[TiUT] 

Clearly, if r < n, then (t„) = (t„)[^.u^]. Otherwise (r„) < (t„)[^^u^]. 

Definition 20. A type r extends another type t' if [r] — [r'] 7^ holds. More- 
over, the extension is called partial or total depending on whether [t] n [r'] ^ 
or not, respectively. □ 

Lemma 6.9. Let D = {A.,w) be an nFA-design, (t„) he a consistent sound 
typing for D, and r G Dec(fli) be an nFA belonging to the decomposition offli, 
for some i in [l..n]. If t partially extends Ti, then the extension (r„)[T-.u^] of 
(t„) is still sound. 

Proof. By definition [20l [r] contains at least a string that does not belong to 
[Ti] but also a string, say w', accepted by both and r. In order to prove the 
statement, we show that w(r„)[T-. |„] < CI holds for each u G [t] — [t^] (recall that, 
by Lemma I6TT1 n < A). 

Since, by Theorem 16.31 ''"i ^ then there is a nonempty set A C Aut{Cli) 
containing all-and-only the automata accepting u' . Clearly, since r G Dec{D,i) 
and u' G [r], then r is also in Dec{T' , i) for each r' G A. This means that each 
string u G [r] — [r^] is accepted by all-and-only the automata in A as well. By 
Theorem 16. 3[ w(t„) < ft, and in particular w(t„)[^.|„/] < fi, as u' G [t; n r]. 
In other words, any string in [w(t„)[t-.|„/]] is accepted by (at least) a sequence 
of automata in Seq(fl). Finally, as both u and u' are recognized by all-and- 
only the automata in A, then each string woaiWi . . . UnWn in [w(r„)[T-.|„]] (with 
Gi = u) has a twin in [i(;(r„)[^.|„/]] (with = u') and both of them are accepted 
by exactly the same sequences in Seq{Vt). □ 

Theorem 6.10. Let (r„) be a maximal typing for a kernel string ui(f„) and an 
nFA A compatible with w. Then for each i, Dec{Ti,i) C Dec{^li). 
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Proof. Let i be an index arbitrarily fixed in As (t„) is maximal then, 

by definition, it is sound and, by Theorem 16. 3[ < £7^. Let Di be a copy 
of Dec{ni). Then < UZ)j. Remove now, from Di, each automata tu- (if 
any) such that [r^ij n [n] = 0. Still, < U-Di holds. Hence, consider the two 
possible (and alternative) cases: (1) = UDi, or (2) < UDi. In the first case 
the theorem is already proved. While, in the latter case, there is (at least) an 
automaton t £ Di that partially extends entailing relation (r„) < (T„)[T-.yT-]. 
But since (t„)[^.u^] is still sound (see Lemma l6.9p . then there is a contradiction 
because (t„) is assumed to be maximal. □ 

We arc now ready to prove a main results of the section. In our original 
paper, we showed a 2-expspace upper bound for 3-LOC[nFA] ^^d B-ml^ufa]- 
This was improved to expspace in [31|. We present here an alternative proof 
of that results using the previous decomposition. 

Theorem 6.11. Problems El-LOC[„ir^] and 3-ML[nFA] are expspace. 

Proof. By Lemma 16.91 and Theorem I6.10[ if an nFA-design D — (r, w) admits 
a (maximal) local typing, say (t„), then for each there exists a subset of 
Dec{^i), say Di, such that DDi = ti. 

Let m be the number of states of r, and v + n he the length of w where 
n is clearly the number of functions and v is the length of the non-function 
symbols in w. By definition, for each i in each automaton in Aut{Q,i) 

has size at most m and the cardinality of Aut(Q,i) is at most m^. Thus, the 
cardinality of Dec{Q,i) is no more than 2™ , as well as the cardinality of Di. 
In the worst case, an automaton in Dec{fli) is obtained as flAi — UA2 where 
both I All = IA2I = 0(m2). So, the size of nAi is no more than (m^)™' 
that is clearly lower than 2™^. The size of UA2 is at most Now, for 

computing flAi — UA2 we perform the following intersection (flAi) n (UA2). The 



complement of UA2 may have 2™ states [22|. Finally flAi — UA2 require no 
more than 2^™ states, and the size of UDi = Ti is at most 2^™ * 2™ being 
clearly 2<^ 

Now, we are ready for computing the size of the nFA w(t„). It is exactly 
1^ + n * 2'-'('"^^. So, for deciding whether w(t„) = t we need no more than 
exponential space w.r.t. the input size ly + n + m. The only problem we still 

have is that we do not know a priori how to choose Di. There are 2^ possible 
subsets. But as nexpspace = expspace (by Savitch's theorem), then we can 
simply guess each Di. 

About 

iB-MLr^FAli ^6 must find a maximal Di. But in expspace we can still 



guess the sequence Di, . . . , Dn and prove (by Theorem 16. lOp . for each Di, that 
none of the automata in Dec{fli) — Di can be added to Di because the resulting 
typing would loose its soundness. After the guess, the number of checks (each 
of which may require exponential space) is at most n * 2™ . □ 
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7. Complexity for trees 



Based on Theorcm l6.11[ we now obtain complexity bounds for the tree prob- 
lems. This completes results obtained in 0, |31| on this topic. The next result 
first appeared in However, the sketch of proof given there was not correct. 



A proof was then presented in 3l|. We next present a new proof based on 
perfect automata. 

Theorem 7.1. ML[„f^] is in pspace (so the problem is pspace- complete). 

Proof. Let D = {t,w) be an nFA-design, and (r„) be a D-consistent typing. 
First of all, we check if (r„) is local (and we have already proved that LOC[nFA] 
is doable in pspace). If so, then f fl w(t„) = (where f is the nFA of possibly 
exponential size accepting the complement of language [r]). Subsequently, we 
check if (r„) is not maximal. In particular, by Lemma 16.91 and Theorem 16.101 
(t„) is not maximal if there is an nFA A £ Dec{ni) for some i in [l..n\ such that 
at least one of the following is true: 

. A totally extends r,; and w{T„)[riUA] is still sound, namely [A] fl [r^] = and 

. A partially extends n, namely [A] — [n] ^ and \A\ fl [r^] ^ 0; 
So we proceed as follows: 

1. Guess an index i, and a nonempty set of automata Ai C Auti^i) 

2. Compute A2 = Aut{^i) - Ai 

3. Let A denote the automaton PiAi — UA2 (we do not really build it); 

4. If \A\ n \Ti\ = then, 

. if f n w(T„)[T.y^] = 0, then (t„) is not maximal 

. else if \A^ — [t.;] = \A\ fl [t7] ^ 0, then (t„) is not maximal 

Observe that even if A^ f, or fi may be exponential in size, we only use them 
for intersection nonemptiness or intersection emptiness problems that are both 
NL-complete problems (26j . Intuitively, we could avoid the materialization of 
such automata with "on-the-fly" constructions. Hence, an nl algorithm on a 
non-materialized (single) exponential automaton leads to pspace. More for- 
mally, we consider alternating finite state machines aFAs (for more details see 



13, |42|). We do not completely define them but we just recall what we need: 

. given an aFA A^ deciding whether \A\ = is PSPACE-complete; 
. Any nFA is trivially a special kind of aFA; 

. Given two aFAs A and A! , a new aFA for A, AV^ AI , and A n M , can be 
constructed in polynomial time and its size is linear. 

Finally, we observe that all the above emptiness decisions deal with nFAs of 
polynomial size and can be checked in pspace as well as the nonemptiness 
decisions as pspace is closed under complement. □ 

Now, we show how to reduce locality problems on boxes to locality problems 
on strings. 
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Definition 21. Let D = (r, B) be an 7?.-design where B = BofiBi . . . f„i?„ is 
a kernel box. Consider tlie A:*'* sequence of strings wq, . . . ,Wn (witli 1 < fc < 
\Bq\* . . . * \Bn\) built from Bq, . . . , Bn by varying Wi among the strings in [Bi] 
(in some fixed order) for each i £ [0..n]. We denote by D'^ = {t^w^) the fc*'' 
72.-design built from D where w^{in) is the kernel string w^fiwi . . . f„ti;„. □ 

Lemma 7.2. Let D = (r, B) he an TZ-design and (r„) be a D-consistent typing. 
We have that: 

(1) // (t„) is local for D, then it is sound for each ; 

(2) // (t„) is sound for each , then it is sound for D as well. 

Proof. (1): If (r„) is local for D, then BqTiBi . . .TnBn = t. This means that 
woTiWi . . . r„u)„ < r for each Wi e [Bi]. Thus, (r„) is sound for each . 

(2): If for each design we have that wqTiWi . . .TnWn < t holds, then 

BqTiBi . . . TnBn < T aS WcU. □ 

A direct consequence of the above theorem is that if a typing is not sound 
for some D'', then it can not be local for D. So a local typing candidate for D 
is a typing being sound for each D^. Now suppose that (t„) is a maximal sound 
typing for D'^^ but it is not sound for D^^ . This means that at least one [ri] 
contains some extra string such that [w'°^(r„)] is not fully contained in [r]. So 
we could remove such strings to obtain a typing sound for both D'^^ and D''^ but 
not maximal for D^^ any more. So we can guess a maximal sound typing for each 
D'^ and then, remove the exceeding strings. This is equivalent to keeping the 
componentwise intersection of these maximal typings. Let (3 — |_Bo| * • • • * \Bn\-, 
we should build /3 (it is an exponential number) perfect automata. For each i in 
[l..n] we should consider the sets of automata Aut{il,j), . . . , Aut{D,^) and from 
these the respective decompositions Dec{n} ), . . . , Dec{n^). Now we can guess 
P subsets and finally compute n as {LlD})n . . . n (UDf ). But this is 

equivalent to consider directly Aut{rti) = Aut{rtl) U . . .U Aut{il^), compute the 
decomposition Dec{VLi) and guess a subset Di from Dec(fli). This is much more 
convenient because Aut(fli) contains at most a quadratic number of automata 
w.r.t. the states of r. Now, we show how to extend the construction of f2 to a 
box-design for obtaining this new Aut(il,i). Let A be an nFA and B{{n) a kernel 
box, we have that: 

. Ini{A, Bi) = {q, e K : 3qf £ K s.t. (g„ w, qf) £ A*, u; G [Bi]} 
. Fin{A, Bi) = {qj £ K : 3q, ^ K s.t. (g„ w, qf) £ A*, u; £ [B,]} 
. A{B,^i,Bi) = {A{q,,qf) : q, £ Fin{A,B,^{), qj £ Ini{A,B,)} 

Aut{Vli) is the set of all legal automata in A{Bi^i^Bi) as for string. Note 
that, due to the structure of each B;, it is very easy to build Ini{A, Bi) and 
Fin(A, Bi) without enumerating all the strings in [Bi]. 

Theorem 7.3. Problems El-LOCj^^^, and are in expspace. 
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Proof. Let D = (r, B) be an nFA-design where B is a kernel box. We guess, 
for each i, a subset of automata in Dec{Ui), the decomposition of the new set 
Aut{ni) built as above. Thus, we check if it is a (maximal) local typing for D 
as made in the proof of Theorem 16.111 □ 

Corollary 7.4. 3-LOC[„f4_££,j-x,] and 3-ML[„j:-ji_£.cj-£j are in 2-expspace. 

Proof. Let D ~ (r, T) be an nFA-EDTD-design. Wc build from r its equivalent 
normalized version r'' that, after all, is a dFA-EDTD of exponential size. So the 
oracle machine discussed in Corollary 14.141 actually works in nexptime'' where 
C is the complexity class of solving El-LOC^pj^j (or By Theorem 17.31 

both of these problems are in expspace. Thus, the whole algorithm works in 

2- EXPSPACE. (Note that, expspace is the best known upper bound even for 

3- LOC[dFA] M-) □ 



The following analysis makes use of a technique introduced in 3l[ for build- 
ing the perfect automaton for dPA-designs. 

Definition 22. Let D = (r, B) be an 7?.-design where B = BoiiBi . . . f„i?„ is 
a kernel box. Together with we consider the string design D'' defined as fol- 
lows. Let S = Sl+){(To, . . . , (7„} be an extension of S and (T(f,i) = aatiai . . . fn<T„ 
be the kernel string built by combining the new symbols with the functions 
of B. We denote by D'' = (0'',cr) the fc*'' dFA-design built from D where 



Cl^ = ii*'(r, w'') is the perfect automaton built as described in [3l[. □ 

The following lemma is a direct consequence of the definition of Cl in [sij . 

Lemma 7.5. A typing (t„) is sound for iff it is sound for D'^ . 

Theorem 7.6. Let D = {t,B) he a dFA-design and (t„) he a D-consistent 
typing. The following are equivalent: 

(1) (t„) is perfect for D; 

(2) (t„) is hoth local for D and perfect for each I)^ . 

Proof. (1) ^ (2): If (t„) is perfect for D, then it is sound for each , and 
by Lemma [7.5[ sound for f)^ as well. Suppose that (r„) is not local for some 
, there is a string cro^io'i • • .Undn G [^'^] (all the stings have this form by 
definition) not captured by cr(T„). By Lemma |7.5[ the string wqUiWi . . .u„Wn 
belongs to [r] and as {t„) is perfect, then each m S [r,]: contradiction. Suppose 
that (r„) is not perfect for some D''. There is a sound typing (t^) for D'^ not 
contained in (t„), but by Lemma [7.51 {"Tn) also sound for so w{t'^) < r: 
again a contradiction because (t„) is perfect. 

(2) (1): If (r„) is perfect for each Z)'^ then, by Lemma [7.5[ it is sound for 
each D'^. Suppose that it is not perfect for D, then there is a sound typing 
(t^) not contained in {t„) such that, for some k, w{t^) < [t] for the fc*'' string 
wq, . . . ,Wn. So (t^) is sound for D'' and also for D''. Contradiction. □ 

Lemma 7.7. is in coNP. 
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Proof. Let D = (r, B) be a dFA-design where B = BgiiBi . . . f„i3„ is a kernel 
box. We can decide in np whether D does not admit any perfect typing by 
preforming the fohowing steps: 

1. Guess: four string-designs D''^ , D^^ , D^^ , and D*^*. 

2. Check: answer "yes" [D does not admit any perfect typing) if at feast 
one of the foUowing holds 

(a) D^^ does not admit any perfect typing; 

(b) D^^ , D^^ have different perfect typings; 

(c) D''^ admits a perfect typing, say (t„), but it is not local for D. 

Each of check (a) , (6) , and the first part of (c) require polynomial time |3l| . For 
the second part of check (c) we build i?(T„) and prove that i?(T„) < r, namely 
B{Tn) Hr = 0. Notice that if the yes answer only depends on step (c) this means 
that (t„) is sound for each ^ and so it is not possible that B{t„) > t. Thus, 
as T is a dFA, its complement has the same size and the intersection emptiness 
can be done in polynomial time as well. □ 

Corollary 7.8. 3-PERF[„f^_££,jii] is in coNEXPTIME. 

Proof. Let D ~ (r, T) be an nFA-EDTD-design. We build from t its equiva- 
lent normalized version t'^ that, after all, is a dPA-EDTD of exponential size. 
By CoroUarv 14.161 we polynomial-time reduce 3-perF[7j_edtd] (for normalized 
7?.-EDTDs) to So in our case we call But, as r'* may 

be exponentially larger, then, by adapting the upper bound of Lemma 17. 7[ the 
whole algorithm works in coNEXPTIME. □ 

Theorem 7.9. V'EKF\^nFA-EDTi)i is in coNEXPTIME. 

Proof. Let D = (r, T) be an nFA-EDTD-design, and (r„) be a D-consistent typing. 
Compute in coNEXPTIME a perfect typing (r^) if there is one. Transform (r„) 
into a dFA-EDTD-typing of exponential size. As EQUIVjjjutas] is in ptime then we 
can decide in exptime whether (t„) and (r,'J are equivalent. □ 

Unfortunately, for mljupa.edtd] we do not have any good algorithm. Let D = 
(r, T) be an nFA-EDTD-design, (t„) be a maximal local typing for D, and k be de 
function induced by (r„) and T. At the moment, we do not even know whether 
there could be a (non- maximal) local typing (r^) < (t„) such that k' < k,. If 
there is none, given a local typing (t„) and its induced function k, then each 
maximal local typing that extends (t„) has to induce the same k as well. So we 
could compare the various with (t,i). But, the only know upper bound is 
given by the following theorem. 

Theorem 7.10. 

^^[nFA-EDTl3\ *^ 2-EXPSPACE. 

Proof. Let D = (r, T) be an nFA-EDTD-design and (t„) be a Z?-consistent typing. 
We can check whether it is not maximal. Check in exptime whether it is local 
or not. So, build the normalized type t'^ from r. Guess a function n and check 
whether each admits a local typing. This is in 2-expspace by Corollarv l7.4l 
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So, build the typing (r^) induced by the box-designs. It may be an nFA-EDTD 
typing exponentially larger. Check whether (r„) < (r^). This can be done in 
2-EXPTiME. So the algorithm works in 2-expspace and as this class is closed 
under complementation we also can decide mlj^fa-edtd] in it. □ 

Finally, we consider the reduction from trees to boxes for 3-mL[<jre-edtd] • The 
difficulties affecting ML[nFA-EDTD] (^s wc do not know whether there could be a 
local typing (r^) < (r„) such that k' < k) concern also the existential problem 
in case of dREs. 

Theorem 7.11. B-ML^mE.EDTDl for normalized TZ-EDTDs is decidable by an or- 
acle machine in pspaciP where C is the complexity class of solving the most 
difficult problem between El-ML^^^j B-LOC^^^j . 

Proof. In this case we have to check two sources of maximality depending on the 
choice of n and on the related box-designs. To do that, we guess a function k 
(the candidate for a maximal local typing) and we check whether each induced 
box-design (i) admits a local typing, (ii) is maximal and (iii) is dRE-dcfinable. 
So, we have to prove that each k' > n does not lead to any local typing. In 
particular: 

1. Guess a functions k; 

2. Prove that, for each node a; of T with lab(a;) G S, the answer of 
over is "yes"; 

3. Prove that, for each k' > k, there is at least a node a; of T with lab(a;) G E 
such that the answer of EI-LOC^pj^j over Z?^, is "no" . 

We just notice that there could be an exponential number of k' to be enumerated 
and checked, as well as the number of calls to . □ 

8. Conclusion 

As explained in the introduction, this work can serve as a basis for designing 
the distribution of a document. It would be interesting to extend our definitions 
and methods to richer types of web data. First, this would involve graph data 
and not just tree data. Then one should consider unordered collections and 
functional dependencies as in the relational model [s^ . Other dependencies 
and in par ticular inclusion dependencies would also clearly make sense in this 
setting 140| . More specific design methodology would also extend the techniques 
presented in this paper by considering concrete network configurations; this is 
left for future research. 

Database design has a long history, see most database text book. Distributed 
database design has also been studied since the early days of databases, but 
much less, because distributed data management was limited by the difficulty 
to deploy distributed databases. The techniques that were developed, e.g., ver- 
tical and horizontal partitioning, are very different from the ones presented here 
because we focus on ordered trees and collections are not ordered in relational 
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databases. We believe that traditional database studies even on mainly the- 
oretical topics such as normal forms are also relevant in a Web setting. An 
interesting direction of research is to introduce some of these techniques in our 
setting. 

In the paper, the focus was on local typing that forces verification to be 
purely local. More generally, it would be interesting to consider typings of the 
resources that would minimize the communications needed for type checking 
(and not completely avoid them). Moreover, it would be interesting to analyze 
cases where a kernel document may change from time to time by adhering to 
some global type which uses function symbols in the specification itself. We 
are investigating this direction. Let us give a short example exhibiting some 
of the new difficulties that would arise in case kernel document changes were 
taken into account. Consider the kernel string w = af and the type r = ailba'^ . 
By directly applying the techniques proposed in this paper, it seems clear that 
f?6a+ would be the perfect typing for this design. So, one extension of w may 
be atba (by attaching the tree f 6a complying with the perfect typing) which, in 
turn, represents a new kernel. But, this extension might still be extended, by 
attaching again tree fba, to form afbaba, since the first extension still contains 
a function call and the perfect typing defined for the remote resource should not 
vary. This last step could be performed several time. The language obtained by 
all possible extensions is defined by the type af?(6a"'")+, being clearly difi'erent 
from T. The problem here is that r does not express directly a set of trees 
without taking into account a specific typing. New interesting questions might 
be: How to look for typings that are, in a sense, fixpoints w.r.t. the original type 
with functions? or How to avoid irregularities? or even Is the perfect typing 
still unique ? Finally, interesting issues may also come from studying the impact 
of distributed typing (as studied here) on query optimization. 
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